Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.
Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.
Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.
The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).
“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.
The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:
It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.
“1” in the target variables should be considered as “failure” and “0” represents “No failure”.
# Installing the libraries with the specified version. I needed to install these individually vs. as a single line to avoid errors.
#!pip install pandas==1.5.3 -q --user #done
#!pip install numpy==1.25.2 -q --user #done
#!pip install matplotlib==3.7.1 -q --user #done
#!pip install seaborn==0.13.1 -q --user #done
#!pip install scikit-learn==1.2.2 -q --user #done
#!pip install imbalanced-learn==0.10.1 -q --user #done
#!pip install xgboost==2.0.3 -q --user #done
#!pip install threadpoolctl==3.3.0 -q --user #done
# Import libraries we need to do our work
# Read and manipulate data
import numpy as np
import pandas as pd
# HTML styling
from IPython.display import display, HTML
# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Statistics
from scipy.stats import shapiro
# Model tuning and metrics
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay
)
# Impute missing values
from sklearn.impute import SimpleImputer
# Build logistic regression model
from sklearn.linear_model import LogisticRegression
# Over/under sample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# Suppress warnings
import warnings
warnings.filterwarnings("ignore")
# Load Google Colab
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# Define path to data file
path = r'/content/drive/MyDrive/Learning/Data Coursework/PGP-DSBA/6-Model Tuning/Project 6/'
df_train = pd.read_csv(path + 'train.csv')
df_test = pd.read_csv(path + 'test.csv')
print('Training dataset has', df_train.shape[0], 'rows and', df_train.shape[1], 'columns')
Training dataset has 20000 rows and 41 columns
print('Test dataset has', df_test.shape[0], 'rows and', df_test.shape[1], 'columns')
Test dataset has 5000 rows and 41 columns
# Here, we will want to combine the datasets first into our EDA dataframe
df_eda = pd.concat([df_train, df_test], axis=0).reset_index()
# View information
print('Training')
display(df_eda.info())
Training <class 'pandas.core.frame.DataFrame'> RangeIndex: 25000 entries, 0 to 24999 Data columns (total 42 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 25000 non-null int64 1 V1 24977 non-null float64 2 V2 24976 non-null float64 3 V3 25000 non-null float64 4 V4 25000 non-null float64 5 V5 25000 non-null float64 6 V6 25000 non-null float64 7 V7 25000 non-null float64 8 V8 25000 non-null float64 9 V9 25000 non-null float64 10 V10 25000 non-null float64 11 V11 25000 non-null float64 12 V12 25000 non-null float64 13 V13 25000 non-null float64 14 V14 25000 non-null float64 15 V15 25000 non-null float64 16 V16 25000 non-null float64 17 V17 25000 non-null float64 18 V18 25000 non-null float64 19 V19 25000 non-null float64 20 V20 25000 non-null float64 21 V21 25000 non-null float64 22 V22 25000 non-null float64 23 V23 25000 non-null float64 24 V24 25000 non-null float64 25 V25 25000 non-null float64 26 V26 25000 non-null float64 27 V27 25000 non-null float64 28 V28 25000 non-null float64 29 V29 25000 non-null float64 30 V30 25000 non-null float64 31 V31 25000 non-null float64 32 V32 25000 non-null float64 33 V33 25000 non-null float64 34 V34 25000 non-null float64 35 V35 25000 non-null float64 36 V36 25000 non-null float64 37 V37 25000 non-null float64 38 V38 25000 non-null float64 39 V39 25000 non-null float64 40 V40 25000 non-null float64 41 Target 25000 non-null int64 dtypes: float64(40), int64(2) memory usage: 8.0 MB
None
# Describe the data. Can also use include=all to also show categorical series.
df_eda.describe(include='all').T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| index | 25000.0 | 8499.500000 | 6007.060566 | 0.000000 | 3124.750000 | 7499.500000 | 13749.250000 | 19999.000000 |
| V1 | 24977.0 | -0.273121 | 3.446501 | -12.381696 | -2.738531 | -0.749797 | 1.838215 | 15.493002 |
| V2 | 24976.0 | 0.431931 | 3.148527 | -12.319951 | -1.642559 | 0.464706 | 2.527319 | 14.079073 |
| V3 | 25000.0 | 2.498117 | 3.376625 | -10.708139 | 0.226105 | 2.256621 | 4.570447 | 17.090919 |
| V4 | 25000.0 | -0.076310 | 3.428030 | -15.082052 | -2.338280 | -0.137410 | 2.135586 | 13.236381 |
| V5 | 25000.0 | -0.059025 | 2.106000 | -8.603361 | -1.548581 | -0.107352 | 1.340579 | 8.133797 |
| V6 | 25000.0 | -1.004782 | 2.033960 | -10.227147 | -2.351804 | -1.006251 | 0.365115 | 6.975847 |
| V7 | 25000.0 | -0.885044 | 1.763109 | -8.124230 | -2.035858 | -0.920190 | 0.222579 | 8.006091 |
| V8 | 25000.0 | -0.553475 | 3.302968 | -15.657561 | -2.642665 | -0.382091 | 1.721018 | 11.679495 |
| V9 | 25000.0 | -0.007422 | 2.163327 | -8.596313 | -1.485926 | -0.070398 | 1.420234 | 8.850720 |
| V10 | 25000.0 | -0.006693 | 2.183726 | -9.853957 | -1.400021 | 0.110723 | 1.483866 | 8.108472 |
| V11 | 25000.0 | -1.918038 | 3.122172 | -14.832058 | -3.945890 | -1.944540 | 0.102854 | 11.826433 |
| V12 | 25000.0 | 1.599143 | 2.925822 | -12.948007 | -0.408998 | 1.502393 | 3.570925 | 15.080698 |
| V13 | 25000.0 | 1.588880 | 2.876298 | -13.228247 | -0.208349 | 1.652288 | 3.460339 | 15.419616 |
| V14 | 25000.0 | -0.944725 | 1.792426 | -7.813929 | -2.159781 | -0.946795 | 0.270961 | 5.734112 |
| V15 | 25000.0 | -2.422429 | 3.361377 | -16.416606 | -4.432551 | -2.390213 | -0.376356 | 12.246455 |
| V16 | 25000.0 | -2.943880 | 4.230368 | -20.985779 | -5.639162 | -2.700040 | -0.115004 | 13.975843 |
| V17 | 25000.0 | -0.128153 | 3.343630 | -14.091184 | -2.217008 | -0.007411 | 2.075780 | 19.776592 |
| V18 | 25000.0 | 1.190599 | 2.591052 | -12.214016 | -0.404954 | 0.882952 | 2.578819 | 13.642235 |
| V19 | 25000.0 | 1.187544 | 3.394428 | -14.169635 | -1.045095 | 1.281696 | 3.499366 | 13.237742 |
| V20 | 25000.0 | 0.046572 | 3.667234 | -13.922659 | -2.413293 | 0.064728 | 2.517517 | 16.052339 |
| V21 | 25000.0 | -3.621881 | 3.569714 | -17.956231 | -5.933420 | -3.563224 | -1.278000 | 13.840473 |
| V22 | 25000.0 | 0.953860 | 1.649298 | -10.122095 | -0.103713 | 0.976710 | 2.026237 | 7.505291 |
| V23 | 25000.0 | -0.377329 | 4.036824 | -14.866128 | -3.112878 | -0.267060 | 2.442971 | 14.458734 |
| V24 | 25000.0 | 1.125279 | 3.923323 | -16.387147 | -1.497698 | 0.958284 | 3.544675 | 17.806035 |
| V25 | 25000.0 | 0.010498 | 2.015558 | -8.228266 | -1.352614 | 0.036082 | 1.402690 | 8.223389 |
| V26 | 25000.0 | 1.868480 | 3.428152 | -11.834271 | -0.322190 | 1.943616 | 4.136977 | 17.528193 |
| V27 | 25000.0 | -0.600410 | 4.375666 | -14.904939 | -3.653682 | -0.879108 | 2.208137 | 17.560404 |
| V28 | 25000.0 | -0.880110 | 1.919381 | -9.269489 | -2.168125 | -0.898860 | 0.384828 | 7.415659 |
| V29 | 25000.0 | -1.007661 | 2.678917 | -12.579469 | -2.799879 | -1.208240 | 0.608942 | 14.039466 |
| V30 | 25000.0 | -0.036167 | 3.009096 | -14.796047 | -1.895801 | 0.165476 | 2.014656 | 12.505812 |
| V31 | 25000.0 | 0.483236 | 3.458316 | -13.722760 | -1.818319 | 0.489555 | 2.739329 | 17.255090 |
| V32 | 25000.0 | 0.289553 | 5.517512 | -19.876502 | -3.444129 | 0.025922 | 3.759037 | 26.539391 |
| V33 | 25000.0 | 0.023837 | 3.568291 | -16.898353 | -2.260302 | -0.083879 | 2.224481 | 16.692486 |
| V34 | 25000.0 | -0.448694 | 3.180361 | -17.985094 | -2.110809 | -0.238079 | 1.443841 | 14.358213 |
| V35 | 25000.0 | 2.225937 | 2.939321 | -15.349803 | 0.334812 | 2.103187 | 4.056822 | 15.291065 |
| V36 | 25000.0 | 1.530816 | 3.795756 | -14.833178 | -0.929834 | 1.586291 | 4.011064 | 19.329576 |
| V37 | 25000.0 | 0.013639 | 1.787566 | -5.478350 | -1.252857 | -0.122369 | 1.186442 | 7.467006 |
| V38 | 25000.0 | -0.356352 | 3.952311 | -17.375002 | -2.987116 | -0.333115 | 2.283132 | 15.289923 |
| V39 | 25000.0 | 0.900282 | 1.745877 | -6.438880 | -0.259414 | 0.926280 | 2.071508 | 7.759877 |
| V40 | 25000.0 | -0.886985 | 3.005419 | -11.023935 | -2.950211 | -0.940870 | 1.111955 | 10.654265 |
| Target | 25000.0 | 0.055680 | 0.229307 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
# Retrieve and preview the categorial column values, using a list.
# Create numerical and categorical column lists
list_catcol = df_eda.select_dtypes(include=['object']).columns.to_list()
list_numcol = df_eda.select_dtypes(exclude=['object']).columns.to_list()
print('Dataset has', len(list_catcol), 'categorical columns and', len(list_numcol), 'numerical columns.')
# Use a for loop to show the counts and percent contribution of each categorical column.
for col in list_catcol:
print(f'***** {col} *****')
print('Preview counts for:',df_eda[col].value_counts(),'\n')
print('Preview % breakdown for:',df_eda[col].value_counts(normalize=True),'\n')
Dataset has 0 categorical columns and 42 numerical columns.
# Check counts, unique counts and duplicate values for each column. Based on this, we can decide how to approach each variable.
for col in df_eda:
unique_count = df_eda[col].nunique()
duplicate_count = df_eda[col].count()-df_eda[col].nunique()
empty_count = df_eda[col].isnull().sum().sum()
print(f'Series {col} has {unique_count} unique values, {duplicate_count} duplicate values, and {empty_count} missing values.')
Series index has 20000 unique values, 5000 duplicate values, and 0 missing values. Series V1 has 24977 unique values, 0 duplicate values, and 23 missing values. Series V2 has 24976 unique values, 0 duplicate values, and 24 missing values. Series V3 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V4 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V5 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V6 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V7 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V8 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V9 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V10 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V11 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V12 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V13 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V14 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V15 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V16 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V17 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V18 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V19 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V20 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V21 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V22 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V23 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V24 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V25 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V26 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V27 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V28 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V29 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V30 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V31 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V32 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V33 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V34 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V35 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V36 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V37 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V38 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V39 has 25000 unique values, 0 duplicate values, and 0 missing values. Series V40 has 25000 unique values, 0 duplicate values, and 0 missing values. Series Target has 2 unique values, 24998 duplicate values, and 0 missing values.
# Check for null values
print(f'EDA dataset has {df_eda.isnull().sum().sum()} null values:',)
# Replace any missing values with NaNs
df_eda = df_eda.replace("?", np.nan)
# Count NaN values
nan_count = df_eda.isnull().sum().sum()
if nan_count > 0:
print(f"EDA DataFrame has {nan_count} NaN values.")
else:
print("EDA DataFrame does not have NaN values.")
EDA dataset has 47 null values: EDA DataFrame has 47 NaN values.
# Check for duplicate column names
print(df_eda.columns[df_eda.columns.duplicated()])
Index([], dtype='object')
# Preview rows with missing data
df_eda[df_eda.isnull().any(axis=1)]
| index | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 89 | 89 | NaN | -3.961403 | 2.787804 | -4.712526 | -3.007329 | -1.541245 | -0.881148 | 1.476656 | 0.574700 | ... | -8.326069 | -5.140552 | 1.121314 | -0.305907 | 5.315007 | 3.750044 | -5.631174 | 2.372485 | 2.195956 | 0 |
| 613 | 613 | -2.048681 | NaN | -1.623885 | -3.324224 | 0.152256 | 0.600157 | -1.812802 | 0.852194 | -1.522600 | ... | 3.264218 | 2.379064 | -2.457084 | 1.719365 | 2.537010 | 1.701780 | -1.434535 | 0.597365 | 0.739238 | 0 |
| 2236 | 2236 | -3.760658 | NaN | 0.194954 | -1.637958 | 1.261479 | -1.573947 | -3.685700 | 1.575651 | -0.309823 | ... | 7.620821 | 1.695061 | -3.956354 | 2.707644 | 4.657387 | 1.619307 | -5.537285 | 1.246650 | -1.162793 | 0 |
| 2508 | 2508 | -1.430888 | NaN | 0.659576 | -2.876402 | 1.150137 | -0.785760 | -1.560174 | 2.898635 | -2.346989 | ... | 6.279266 | 3.323914 | -4.047760 | 3.119220 | 3.336260 | 0.603524 | -3.781725 | -0.157478 | 1.503298 | 0 |
| 4653 | 4653 | 5.465769 | NaN | 4.540947 | -2.916550 | 0.399752 | 2.798925 | 0.029477 | -7.334071 | 1.122874 | ... | -1.535753 | 4.596212 | -4.103525 | 4.295524 | 0.152672 | -3.726700 | 6.562692 | 0.706452 | -0.461696 | 0 |
| 5941 | 5941 | NaN | 1.008391 | 1.227702 | 5.397082 | 0.064230 | -2.706919 | -2.028368 | 0.534046 | 3.006797 | ... | 1.869502 | -3.115298 | -0.550197 | 1.713781 | -2.256960 | 0.410992 | -3.434400 | -1.299388 | -1.768734 | 0 |
| 6317 | 6317 | NaN | -5.205346 | 1.997652 | -3.707913 | -1.042200 | -1.593126 | -2.653309 | 0.852280 | -1.310489 | ... | 3.074149 | -0.067649 | -0.277521 | 3.196840 | 7.016205 | 1.302334 | -4.580096 | 2.956254 | -2.363150 | 0 |
| 6464 | 6464 | NaN | 2.146202 | 5.004415 | 4.192063 | 1.427887 | -6.438263 | -0.931339 | 3.794120 | -0.683032 | ... | 5.231472 | -5.113312 | 1.745687 | 2.587189 | 3.990777 | 0.610716 | -4.273457 | 1.864568 | -3.599079 | 0 |
| 6810 | 6810 | -2.631454 | NaN | 2.330188 | 1.090080 | 0.603973 | -1.139383 | -0.690121 | -1.358935 | 0.355568 | ... | -0.950215 | 0.209717 | 0.448728 | 1.046063 | 0.536937 | 0.763187 | 1.728621 | 1.885821 | -1.701774 | 0 |
| 7073 | 7073 | NaN | 2.534010 | 2.762821 | -1.673718 | -1.942214 | -0.029961 | 0.911323 | -3.199743 | 2.948610 | ... | -4.887077 | -2.611526 | -1.500807 | 2.036186 | -0.828979 | -1.369591 | 0.572366 | -0.132183 | -0.322007 | 0 |
| 7788 | 7788 | -4.203459 | NaN | 2.953868 | 0.584466 | 4.103940 | -0.639211 | -2.810799 | -0.112492 | -1.362768 | ... | 12.522374 | 9.502488 | -7.152953 | 5.668769 | 1.249833 | -2.158520 | -0.954461 | -0.002385 | -1.546808 | 0 |
| 8431 | 8431 | NaN | -1.398710 | -2.008106 | -1.750341 | 0.932279 | -1.290327 | -0.270476 | 4.458834 | -2.776270 | ... | 4.560244 | -0.420834 | -2.037313 | 1.109793 | 1.520594 | 2.113872 | -2.252571 | -0.939249 | 2.542411 | 0 |
| 8439 | 8439 | NaN | -3.840585 | 0.197220 | 4.147789 | 1.151400 | -0.993298 | -4.732363 | 0.558966 | -0.926683 | ... | 6.818725 | 3.451213 | 0.241818 | 3.215765 | 1.203210 | 1.274857 | -1.921229 | 0.578890 | -2.837521 | 0 |
| 8483 | 8483 | -4.484232 | NaN | 1.200644 | -2.042064 | 2.779443 | -0.801748 | -5.403548 | -1.225314 | 1.485831 | ... | 9.467401 | 4.281421 | -7.588117 | 3.266825 | 5.232311 | 1.278590 | -5.370513 | 1.984130 | -1.643391 | 0 |
| 8894 | 8894 | 3.263555 | NaN | 8.446574 | -3.253218 | -3.417978 | -2.995838 | -0.669271 | -0.161283 | -0.666870 | ... | -4.242730 | -3.122680 | 2.522415 | 5.283805 | 7.291310 | -0.867555 | -4.315230 | 3.124488 | -2.393239 | 0 |
| 8947 | 8947 | -3.793170 | NaN | 0.719610 | 2.306296 | 0.934728 | -0.984321 | 0.504867 | -0.441008 | -2.767177 | ... | 1.527720 | -0.496910 | 3.789736 | 1.130689 | 0.618278 | -0.111146 | 5.708912 | 1.542366 | -2.481019 | 0 |
| 9362 | 9362 | 2.662045 | NaN | 2.980068 | 4.430762 | -0.237769 | 0.671919 | 0.380068 | -7.646684 | 4.434754 | ... | -5.493590 | -1.104656 | 1.224987 | 0.975596 | -4.794411 | -2.269039 | 7.670648 | 0.824983 | -3.929104 | 0 |
| 9425 | 9425 | -2.354134 | NaN | 2.053893 | 0.811660 | 2.540366 | -0.924875 | -0.208380 | -0.562864 | -0.140210 | ... | -0.621103 | -0.896509 | -1.181480 | -1.236617 | 1.237120 | 1.228277 | 2.073727 | 1.223874 | 1.472175 | 0 |
| 9848 | 9848 | -1.763501 | NaN | 2.845012 | -2.753083 | -0.811848 | -0.101166 | -1.382141 | -1.105042 | -0.054339 | ... | -2.158880 | 1.859682 | -0.337278 | 1.509300 | 3.408411 | 0.922594 | -1.502959 | 2.514666 | -0.793574 | 0 |
| 11156 | 11156 | NaN | -0.666978 | 3.715829 | 4.934000 | 1.667596 | -4.356097 | -2.823137 | 0.373175 | -0.709951 | ... | 6.663446 | -2.897697 | 3.068461 | 2.486862 | 4.808548 | 0.069305 | -1.215784 | 3.013674 | -5.972586 | 0 |
| 11287 | 11287 | NaN | -2.561519 | -0.180836 | -7.194814 | -1.043832 | 1.384845 | 1.306093 | 1.559192 | -2.992173 | ... | -2.531655 | 0.560392 | -1.153884 | -0.019205 | 4.065248 | 0.978880 | -0.571288 | 0.630374 | 3.919467 | 0 |
| 11456 | 11456 | NaN | 1.299595 | 4.382858 | 1.583219 | -0.076564 | 0.658770 | -1.638530 | -4.814763 | -0.914819 | ... | 1.772287 | 5.755242 | 1.203739 | 5.663939 | 0.413630 | -2.643934 | 5.529745 | 2.104536 | -4.945350 | 0 |
| 11637 | 11637 | -2.270541 | NaN | 1.710061 | 1.157522 | -0.355177 | -5.449480 | -0.786321 | 3.936176 | -1.576138 | ... | 2.651480 | -8.429332 | 3.511387 | 1.500102 | 5.552380 | 2.588580 | -3.453418 | 2.324339 | -2.760081 | 0 |
| 12221 | 12221 | NaN | -2.326319 | -0.051978 | 0.615063 | -0.895755 | -2.437003 | 0.349826 | 2.092611 | -2.933523 | ... | 0.134995 | -5.183424 | 5.251667 | 0.716371 | 3.210930 | 1.641985 | 1.543559 | 1.805163 | -2.039510 | 0 |
| 12339 | 12339 | -1.663687 | NaN | -0.712286 | -4.346935 | 1.391670 | -0.093951 | -2.163175 | -0.380573 | 0.031191 | ... | 0.306588 | -2.690990 | -3.111879 | -1.596402 | 5.821108 | 3.462033 | -1.736752 | 2.291092 | 2.240769 | 0 |
| 12447 | 12447 | NaN | 0.752613 | -0.271099 | 1.301204 | 2.038697 | -1.485203 | -0.411939 | 0.980629 | 0.810336 | ... | 4.410397 | -2.208567 | -1.358706 | -1.725697 | 1.679060 | -0.208564 | -2.335547 | 0.112248 | -0.542931 | 0 |
| 13086 | 13086 | NaN | 2.056243 | 3.330642 | 2.741497 | 2.783166 | -0.444191 | -2.015376 | -0.887154 | -1.110920 | ... | 5.112126 | 4.675408 | -1.709632 | 2.429762 | 0.996644 | -1.190509 | 1.207054 | 0.511023 | -0.884200 | 0 |
| 13411 | 13411 | NaN | 2.704511 | 4.587169 | 1.867930 | 2.050133 | -0.925076 | -1.669496 | -1.653803 | -0.243383 | ... | 2.527207 | 3.625279 | -1.200200 | 2.328028 | 1.666937 | -0.943228 | 0.946846 | 1.655145 | -1.665439 | 0 |
| 14202 | 14202 | NaN | 7.038653 | 2.144536 | -3.201788 | 4.112972 | 3.375972 | -1.337179 | -4.546371 | 1.941427 | ... | 0.157778 | 9.768106 | -10.258190 | 0.513864 | -1.974958 | -0.029436 | 3.127486 | 0.009482 | 4.538125 | 0 |
| 15520 | 15520 | NaN | 1.382556 | 3.236896 | -3.818363 | -1.917264 | 0.437686 | 1.347540 | -2.036067 | 1.155712 | ... | -5.414599 | -0.896510 | -1.057864 | 1.417365 | 1.161990 | -1.147123 | -0.048258 | 0.604532 | 0.814557 | 0 |
| 15913 | 15913 | 0.768122 | NaN | 5.296110 | 0.043018 | -1.173729 | -2.248575 | 0.956395 | -0.089941 | -0.241678 | ... | -7.720265 | -4.518617 | 3.182253 | 0.453452 | 2.175494 | 1.261707 | 0.892630 | 2.026732 | 0.632903 | 0 |
| 16576 | 16576 | NaN | 3.933815 | -0.761930 | 2.651889 | 1.753614 | -0.554092 | 1.829107 | -0.105409 | -3.737081 | ... | 3.486408 | 1.028094 | 2.845747 | 1.744060 | -1.999615 | -0.783041 | 8.698449 | 0.352489 | -2.005397 | 0 |
| 18104 | 18104 | NaN | 1.492173 | 2.659206 | 0.222784 | -0.303648 | -1.347322 | 0.044309 | -0.159095 | 1.108116 | ... | -1.007343 | -2.229579 | -0.870845 | 1.299595 | 0.667952 | -0.503349 | -1.485419 | -0.153722 | 0.156501 | 0 |
| 18342 | 18342 | -0.928572 | NaN | 2.375506 | -1.236914 | 3.228744 | -2.100088 | -2.189908 | 0.588644 | 1.955973 | ... | 1.613181 | -1.820569 | -6.664808 | -0.455080 | 3.054891 | 2.935276 | -3.791135 | 0.863011 | 3.335753 | 0 |
| 18343 | 18343 | -2.377369 | NaN | -0.009173 | -1.471979 | 1.295482 | 0.724894 | -1.122797 | -3.190475 | 3.250575 | ... | -5.093149 | 0.439355 | -3.167241 | -2.713266 | -0.592845 | 3.229219 | 1.315635 | 2.282838 | 1.151589 | 0 |
| 18907 | 18907 | -0.119181 | NaN | 3.657612 | -1.231802 | 1.946873 | -0.119089 | 0.652414 | -1.490208 | -0.033631 | ... | -4.670353 | -0.593916 | -1.650592 | -1.405071 | 1.531267 | 1.079147 | 2.832949 | 1.450781 | 3.232659 | 0 |
| 20709 | 709 | 3.171300 | NaN | -0.899604 | -7.687193 | -1.844379 | 2.229502 | 0.649609 | 0.680742 | -0.079613 | ... | -8.569958 | 1.198974 | -3.747194 | -0.834087 | 0.364598 | 3.687177 | -1.450631 | -0.012682 | 6.569833 | 0 |
| 20859 | 859 | NaN | 1.481190 | 2.208128 | -2.550029 | 1.526045 | -0.964918 | 0.559579 | 3.004337 | -3.937734 | ... | 5.093845 | 2.920339 | -3.080601 | 3.750633 | 2.422388 | -0.692277 | -0.182557 | -0.709241 | 2.498946 | 0 |
| 21070 | 1070 | NaN | 1.222743 | 7.023517 | -1.227970 | -3.385548 | -1.500321 | -0.375947 | -2.898488 | 3.016750 | ... | -6.401373 | -2.539566 | -0.427894 | 4.971210 | 1.229448 | -1.620576 | -2.472413 | 0.692331 | -1.427785 | 0 |
| 21639 | 1639 | NaN | -5.280584 | 1.695313 | -0.787160 | -1.872912 | -0.469312 | -1.970327 | -2.099606 | -1.573940 | ... | -1.939828 | -2.575771 | 5.279322 | 1.557176 | 5.542348 | 1.058056 | 1.696663 | 3.691808 | -3.802066 | 0 |
| 21777 | 1777 | 1.255877 | NaN | 1.123121 | 0.347719 | -0.199314 | 0.542522 | -0.904536 | -2.398356 | 0.228689 | ... | 0.851358 | 1.657839 | -1.410919 | 3.587088 | -1.116910 | -0.865736 | 2.766820 | -0.368560 | -0.864084 | 0 |
| 21832 | 1832 | NaN | -0.558554 | 5.315575 | 1.517019 | -2.304035 | -1.410233 | -1.974341 | -3.081827 | 1.762233 | ... | -3.017435 | -0.475546 | 1.987185 | 4.541473 | 1.335494 | -0.812582 | -0.545365 | 1.922588 | -4.117640 | 0 |
| 21869 | 1869 | -1.272832 | NaN | 4.426359 | -3.013970 | -1.294693 | -0.883173 | -1.731633 | 0.098774 | -0.991360 | ... | -0.396983 | 1.190134 | 0.629071 | 2.411258 | 6.166668 | -0.140616 | -4.208798 | 2.623088 | -1.368893 | 0 |
| 22741 | 2741 | -2.938927 | NaN | 2.913242 | 1.431121 | 4.003345 | -4.743048 | -2.450111 | 3.795883 | -0.339877 | ... | 12.077656 | 0.671770 | -6.354040 | 3.887011 | 3.420416 | 0.506994 | -5.913055 | 0.214129 | -0.931294 | 0 |
| 23266 | 3266 | 5.896134 | NaN | 7.342806 | -1.052112 | -1.393952 | -0.410402 | 0.392391 | -6.141263 | 2.100145 | ... | -5.652135 | -1.973205 | 0.275744 | 3.894656 | 2.108591 | -2.803778 | 3.971349 | 2.233942 | -2.542753 | 0 |
| 24051 | 4051 | NaN | 3.983783 | 0.524783 | -4.776552 | 2.590121 | 1.040410 | 3.097642 | -1.744755 | -0.269377 | ... | -4.134022 | -5.444258 | -1.925177 | -5.736453 | 4.155637 | 0.047600 | 3.864513 | 1.224684 | 4.916014 | 0 |
| 24186 | 4186 | 5.034513 | NaN | 4.450708 | -6.077425 | 0.445417 | 2.491588 | 1.958447 | -5.311945 | -1.397204 | ... | -5.133693 | 1.139969 | -1.421471 | 0.822456 | 4.099736 | -2.152178 | 7.063377 | 2.377923 | 1.906096 | 0 |
47 rows × 42 columns
# Preview the data
display(df_eda.head())
| index | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | ... | V32 | V33 | V34 | V35 | V36 | V37 | V38 | V39 | V40 | Target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | -4.464606 | -4.679129 | 3.101546 | 0.506130 | -0.221083 | -2.032511 | -2.910870 | 0.050714 | -1.522351 | ... | 3.059700 | -1.690440 | 2.846296 | 2.235198 | 6.667486 | 0.443809 | -2.369169 | 2.950578 | -3.480324 | 0 |
| 1 | 1 | 3.365912 | 3.653381 | 0.909671 | -1.367528 | 0.332016 | 2.358938 | 0.732600 | -4.332135 | 0.565695 | ... | -1.795474 | 3.032780 | -2.467514 | 1.894599 | -2.297780 | -1.731048 | 5.908837 | -0.386345 | 0.616242 | 0 |
| 2 | 2 | -3.831843 | -5.824444 | 0.634031 | -2.418815 | -1.773827 | 1.016824 | -2.098941 | -3.173204 | -2.081860 | ... | -0.257101 | 0.803550 | 4.086219 | 2.292138 | 5.360850 | 0.351993 | 2.940021 | 3.839160 | -4.309402 | 0 |
| 3 | 3 | 1.618098 | 1.888342 | 7.046143 | -1.147285 | 0.083080 | -1.529780 | 0.207309 | -2.493629 | 0.344926 | ... | -3.584425 | -2.577474 | 1.363769 | 0.622714 | 5.550100 | -1.526796 | 0.138853 | 3.101430 | -1.277378 | 0 |
| 4 | 4 | -0.111440 | 3.872488 | -3.758361 | -2.982897 | 3.792714 | 0.544960 | 0.205433 | 4.848994 | -1.854920 | ... | 8.265896 | 6.629213 | -10.068689 | 1.222987 | -3.229763 | 1.686909 | -2.163896 | -3.644622 | 6.510338 | 0 |
5 rows × 42 columns
Out of 25,000 rows of data, 47 of them have NaN values (missing data). Since we do not have any business context for these values, but we know they're all numerical, we can use the median to ensure we impute values that lie within the 50% quartile.
# Fill in NaNs for the purpose of our EDA analysis
df_eda = df_eda.fillna(df_eda.median())
# Recheck for null values
print(f'EDA dataset has {df_eda.isnull().sum().sum()} null values:',)
EDA dataset has 0 null values:
# Recheck data types
df_eda.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25000 entries, 0 to 24999 Data columns (total 42 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 25000 non-null int64 1 V1 25000 non-null float64 2 V2 25000 non-null float64 3 V3 25000 non-null float64 4 V4 25000 non-null float64 5 V5 25000 non-null float64 6 V6 25000 non-null float64 7 V7 25000 non-null float64 8 V8 25000 non-null float64 9 V9 25000 non-null float64 10 V10 25000 non-null float64 11 V11 25000 non-null float64 12 V12 25000 non-null float64 13 V13 25000 non-null float64 14 V14 25000 non-null float64 15 V15 25000 non-null float64 16 V16 25000 non-null float64 17 V17 25000 non-null float64 18 V18 25000 non-null float64 19 V19 25000 non-null float64 20 V20 25000 non-null float64 21 V21 25000 non-null float64 22 V22 25000 non-null float64 23 V23 25000 non-null float64 24 V24 25000 non-null float64 25 V25 25000 non-null float64 26 V26 25000 non-null float64 27 V27 25000 non-null float64 28 V28 25000 non-null float64 29 V29 25000 non-null float64 30 V30 25000 non-null float64 31 V31 25000 non-null float64 32 V32 25000 non-null float64 33 V33 25000 non-null float64 34 V34 25000 non-null float64 35 V35 25000 non-null float64 36 V36 25000 non-null float64 37 V37 25000 non-null float64 38 V38 25000 non-null float64 39 V39 25000 non-null float64 40 V40 25000 non-null float64 41 Target 25000 non-null int64 dtypes: float64(40), int64(2) memory usage: 8.0 MB
# Function to plot a boxplot and a histogram on same scale
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
'''
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
'''
# Create a figure with two subplots
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={'height_ratios': (0.25, 0.75)}, # Ratio of subplot heights
figsize=figsize, # Size of the figure
)
# Boxplot
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color='blue'
)
ax_box2.set_title(f'Boxplot of {feature}') # Title for the boxplot
# Histogram
if bins:
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette='winter'
)
else:
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
)
ax_hist2.axvline(
data[feature].mean(), color='green', linestyle="--", label='Mean'
)
ax_hist2.axvline(
data[feature].median(), color='black', linestyle="-", label='Median'
)
ax_hist2.set_title(f'Histogram of {feature}') # Title for the histogram
ax_hist2.legend() # Add legend to the histogram plot
# Adjust layout
plt.tight_layout()
plt.show()
def calculate_z_scores(values, mean, std_dev):
'''
Function to calculate z-score
'''
values_array = np.array(values)
if std_dev == 0:
raise ValueError("Standard deviation is zero; cannot calculate z-scores.")
z_scores = (values_array - mean) / std_dev
return z_scores
# Function to print the statistics for each feature
def print_feature_statistics(data, feature):
'''
Show the statistics of a feature.
data: dataframe
feature: dataframe column
'''
# Chart high-level assessment for each numerical series
mean = data[feature].mean()
median = data[feature].median()
min = data[feature].min()
max = data[feature].max()
mode = data[feature].mode()
std_dev = data[feature].std()
variance = data[feature].var()
this_range = max-min
this_avg_z_score = np.mean(calculate_z_scores(data[feature], mean, std_dev))
print('\n')
print(f'***** {feature} *****')
print(f' * Mean =', mean)
print(f' * Median =' , median)
print(f' * Min =' , min)
print(f' * Max =' , max)
print(f' * Range =', min, 'to' , max,'(', round(this_range,3), ')')
if not mode.empty:
print(f' * Mode =', mode.iloc[0]) # Print only the first mode
else:
print(f' * Mode = No mode')
print(f' * Standard Deviation =', std_dev)
print(f' * Variance =', variance)
if(this_avg_z_score <= -2 or this_avg_z_score >= 2):
print(f' * Not a normal distribution.', end='')
else:
print(f' * Normal distribution.', end='')
# Calculate Shapiro-Wilks to test normal distribution
stat, p = shapiro(data[feature])
print(f'\n * Shapiro test statistic:', stat , end='')
print(f'\n * P-value:', p, end='')
if(p <= .05):
print(f'\n * Has statistical significance.', end='')
else:
print(f'\n * Does not have statistical significance.', end='')
# Analyze critical statistics
for feature in df_eda.columns:
print_feature_statistics(df_eda, feature)
***** index ***** * Mean = 8499.5 * Median = 7499.5 * Min = 0 * Max = 19999 * Range = 0 to 19999 ( 19999 ) * Mode = 0 * Standard Deviation = 6007.060565789698 * Variance = 36084776.64106564 * Normal distribution. * Shapiro test statistic: 0.9267368219914156 * P-value: 1.0927088805999006e-74 * Has statistical significance. ***** V1 ***** * Mean = -0.27355985398908006 * Median = -0.749796735 * Min = -12.38169567 * Max = 15.49300222 * Range = -12.38169567 to 15.49300222 ( 27.875 ) * Mode = -0.749796735 * Standard Deviation = 3.4449451048104507 * Variance = 11.867646775157487 * Normal distribution. * Shapiro test statistic: 0.9801293012964551 * P-value: 4.591698125099776e-49 * Has statistical significance. ***** V2 ***** * Mean = 0.43196290092623996 * Median = 0.4647059405 * Min = -12.31995112 * Max = 14.07907276 * Range = -12.31995112 to 14.07907276 ( 26.399 ) * Mode = 0.4647059405 * Standard Deviation = 3.147015023135167 * Variance = 9.903703555838435 * Normal distribution. * Shapiro test statistic: 0.9997691075608621 * P-value: 0.006505479449517638 * Has statistical significance. ***** V3 ***** * Mean = 2.4981168801447597 * Median = 2.2566205705 * Min = -10.70813868 * Max = 17.09091852 * Range = -10.70813868 to 17.09091852 ( 27.799 ) * Mode = -10.70813868 * Standard Deviation = 3.376624672165808 * Variance = 11.401594176678849 * Normal distribution. * Shapiro test statistic: 0.9932201942093881 * P-value: 4.309688155100891e-32 * Has statistical significance. ***** V4 ***** * Mean = -0.07630997089336 * Median = -0.1374097615 * Min = -15.08205168 * Max = 13.23638142 * Range = -15.08205168 to 13.23638142 ( 28.318 ) * Mode = -15.08205168 * Standard Deviation = 3.428029797575704 * Variance = 11.751388293066924 * Normal distribution. * Shapiro test statistic: 0.9988826180086048 * P-value: 5.2739697981484764e-12 * Has statistical significance. ***** V5 ***** * Mean = -0.059025214209920004 * Median = -0.10735211850000001 * Min = -8.603361053 * Max = 8.133797466 * Range = -8.603361053 to 8.133797466 ( 16.737 ) * Mode = -8.603361053 * Standard Deviation = 2.1059999303135273 * Variance = 4.435235706480581 * Normal distribution. * Shapiro test statistic: 0.9988209568518003 * P-value: 1.88125720469789e-12 * Has statistical significance. ***** V6 ***** * Mean = -1.00478202141156 * Median = -1.0062513135 * Min = -10.22714678 * Max = 6.975846518 * Range = -10.22714678 to 6.975846518 ( 17.203 ) * Mode = -10.22714678 * Standard Deviation = 2.0339601472183104 * Variance = 4.136993880472332 * Normal distribution. * Shapiro test statistic: 0.9998282799380676 * P-value: 0.04714963243721654 * Has statistical significance. ***** V7 ***** * Mean = -0.88504404836164 * Median = -0.9201901809999999 * Min = -8.12423002 * Max = 8.006091398 * Range = -8.12423002 to 8.006091398 ( 16.13 ) * Mode = -8.12423002 * Standard Deviation = 1.7631086930119364 * Variance = 3.1085522633742584 * Normal distribution. * Shapiro test statistic: 0.9967209144524218 * P-value: 9.67130481199252e-23 * Has statistical significance. ***** V8 ***** * Mean = -0.55347485404444 * Median = -0.382090539 * Min = -15.65756062 * Max = 11.67949526 * Range = -15.65756062 to 11.67949526 ( 27.337 ) * Mode = -15.65756062 * Standard Deviation = 3.3029681194748086 * Variance = 10.909598398266954 * Normal distribution. * Shapiro test statistic: 0.995586667854101 * P-value: 2.3277277048965216e-26 * Has statistical significance. ***** V9 ***** * Mean = -0.007421937107600006 * Median = -0.070397874 * Min = -8.596313119 * Max = 8.850720458 * Range = -8.596313119 to 8.850720458 ( 17.447 ) * Mode = -8.596313119 * Standard Deviation = 2.1633268304950244 * Variance = 4.679982975539647 * Normal distribution. * Shapiro test statistic: 0.9989722816666619 * P-value: 2.5175398787453112e-11 * Has statistical significance. ***** V10 ***** * Mean = -0.006693435519920004 * Median = 0.1107225385 * Min = -9.853956957 * Max = 8.108472118 * Range = -9.853956957 to 8.108472118 ( 17.962 ) * Mode = -9.853956957 * Standard Deviation = 2.1837257733969833 * Variance = 4.768658253398253 * Normal distribution. * Shapiro test statistic: 0.9959755383218806 * P-value: 3.3185000634724634e-25 * Has statistical significance. ***** V11 ***** * Mean = -1.9180375490741999 * Median = -1.9445396785 * Min = -14.83205777 * Max = 11.82643333 * Range = -14.83205777 to 11.82643333 ( 26.658 ) * Mode = -14.83205777 * Standard Deviation = 3.1221719100576166 * Variance = 9.747957435952827 * Normal distribution. * Shapiro test statistic: 0.9991493427568736 * P-value: 7.135058611665803e-10 * Has statistical significance. ***** V12 ***** * Mean = 1.5991427326150403 * Median = 1.502392984 * Min = -12.94800683 * Max = 15.08069781 * Range = -12.94800683 to 15.08069781 ( 28.029 ) * Mode = -12.94800683 * Standard Deviation = 2.925822256635705 * Variance = 8.560435877424847 * Normal distribution. * Shapiro test statistic: 0.9988309764671706 * P-value: 2.2191880496890617e-12 * Has statistical significance. ***** V13 ***** * Mean = 1.58888011520688 * Median = 1.652288348 * Min = -13.22824739 * Max = 15.41961638 * Range = -13.22824739 to 15.41961638 ( 28.648 ) * Mode = -13.22824739 * Standard Deviation = 2.876297945525946 * Variance = 8.273089871436778 * Normal distribution. * Shapiro test statistic: 0.9957841521123951 * P-value: 8.772237301411695e-26 * Has statistical significance. ***** V14 ***** * Mean = -0.9447253371601599 * Median = -0.94679462 * Min = -7.813928901 * Max = 5.734111567 * Range = -7.813928901 to 5.734111567 ( 13.548 ) * Mode = -7.813928901 * Standard Deviation = 1.7924264044668183 * Variance = 3.2127924154298464 * Normal distribution. * Shapiro test statistic: 0.9999042660897248 * P-value: 0.4704837275280468 * Does not have statistical significance. ***** V15 ***** * Mean = -2.42242945357612 * Median = -2.3902134715 * Min = -16.41660621 * Max = 12.24645484 * Range = -16.41660621 to 12.24645484 ( 28.663 ) * Mode = -16.41660621 * Standard Deviation = 3.3613765742319903 * Variance = 11.298852473795591 * Normal distribution. * Shapiro test statistic: 0.9944096460179837 * P-value: 1.922347608839961e-29 * Has statistical significance. ***** V16 ***** * Mean = -2.9438804469343602 * Median = -2.700040316 * Min = -20.98577925 * Max = 13.97584282 * Range = -20.98577925 to 13.97584282 ( 34.962 ) * Mode = -20.98577925 * Standard Deviation = 4.2303683760033275 * Variance = 17.896016596689034 * Normal distribution. * Shapiro test statistic: 0.9964645110642801 * P-value: 1.2454268461070779e-23 * Has statistical significance. ***** V17 ***** * Mean = -0.12815315849512002 * Median = -0.007411486 * Min = -14.09118357 * Max = 19.77659192 * Range = -14.09118357 to 19.77659192 ( 33.868 ) * Mode = -14.09118357 * Standard Deviation = 3.3436297640623014 * Variance = 11.179859999123321 * Normal distribution. * Shapiro test statistic: 0.9959980964357487 * P-value: 3.8938586667588505e-25 * Has statistical significance. ***** V18 ***** * Mean = 1.19059873623828 * Median = 0.8829519619999999 * Min = -12.21401624 * Max = 13.64223452 * Range = -12.21401624 to 13.64223452 ( 25.856 ) * Mode = -12.21401624 * Standard Deviation = 2.591052266327612 * Variance = 6.713551846841453 * Normal distribution. * Shapiro test statistic: 0.9778996219769359 * P-value: 6.026058521619472e-51 * Has statistical significance. ***** V19 ***** * Mean = 1.18754407883628 * Median = 1.2816956145 * Min = -14.16963501 * Max = 13.23774192 * Range = -14.16963501 to 13.23774192 ( 27.407 ) * Mode = -14.16963501 * Standard Deviation = 3.394428147526733 * Variance = 11.52214244872177 * Normal distribution. * Shapiro test statistic: 0.9987150612863137 * P-value: 3.4548109352749365e-13 * Has statistical significance. ***** V20 ***** * Mean = 0.04657186526047999 * Median = 0.06472784200000001 * Min = -13.92265856 * Max = 16.05233884 * Range = -13.92265856 to 16.05233884 ( 29.975 ) * Mode = -13.92265856 * Standard Deviation = 3.6672336140204 * Variance = 13.448602379801121 * Normal distribution. * Shapiro test statistic: 0.9998837831075257 * P-value: 0.2726962879600423 * Does not have statistical significance. ***** V21 ***** * Mean = -3.6218808510963596 * Median = -3.563223699 * Min = -17.95623124 * Max = 13.8404729 * Range = -17.95623124 to 13.8404729 ( 31.797 ) * Mode = -17.95623124 * Standard Deviation = 3.56971433014682 * Variance = 12.742860398855559 * Normal distribution. * Shapiro test statistic: 0.9981716089884793 * P-value: 1.8723066248804434e-16 * Has statistical significance. ***** V22 ***** * Mean = 0.9538601776590401 * Median = 0.976710345 * Min = -10.12209478 * Max = 7.50529095 * Range = -10.12209478 to 7.50529095 ( 17.627 ) * Mode = -10.12209478 * Standard Deviation = 1.6492983433040131 * Variance = 2.7201850252253625 * Normal distribution. * Shapiro test statistic: 0.9969278012861842 * P-value: 5.511935144611082e-22 * Has statistical significance. ***** V23 ***** * Mean = -0.37732929862083997 * Median = -0.26705985050000003 * Min = -14.86612839 * Max = 14.45873401 * Range = -14.86612839 to 14.45873401 ( 29.325 ) * Mode = -14.86612839 * Standard Deviation = 4.036824300274403 * Variance = 16.295950431285924 * Normal distribution. * Shapiro test statistic: 0.9988574766389917 * P-value: 3.4498766824567415e-12 * Has statistical significance. ***** V24 ***** * Mean = 1.1252793408382002 * Median = 0.9582842105 * Min = -16.3871471 * Max = 17.80603498 * Range = -16.3871471 to 17.80603498 ( 34.193 ) * Mode = -16.3871471 * Standard Deviation = 3.9233230713629808 * Variance = 15.392463922289052 * Normal distribution. * Shapiro test statistic: 0.9956428764319026 * P-value: 3.380177005782839e-26 * Has statistical significance. ***** V25 ***** * Mean = 0.010498064013999999 * Median = 0.036082446000000004 * Min = -8.228266394 * Max = 8.223388664 * Range = -8.228266394 to 8.223388664 ( 16.452 ) * Mode = -8.228266394 * Standard Deviation = 2.0155584285184576 * Variance = 4.062475778771795 * Normal distribution. * Shapiro test statistic: 0.999407666756791 * P-value: 2.0760540788173595e-07 * Has statistical significance. ***** V26 ***** * Mean = 1.86848039224916 * Median = 1.9436159415 * Min = -11.83427102 * Max = 17.52819276 * Range = -11.83427102 to 17.52819276 ( 29.362 ) * Mode = -11.83427102 * Standard Deviation = 3.428152485328688 * Variance = 11.75222946266526 * Normal distribution. * Shapiro test statistic: 0.9988435636773813 * P-value: 2.7344339436345563e-12 * Has statistical significance. ***** V27 ***** * Mean = -0.6004098035609601 * Median = -0.8791081030000001 * Min = -14.90493915 * Max = 17.56040367 * Range = -14.90493915 to 17.56040367 ( 32.465 ) * Mode = -14.90493915 * Standard Deviation = 4.3756656722805225 * Variance = 19.146450075574155 * Normal distribution. * Shapiro test statistic: 0.9944360999315345 * P-value: 2.2256826408036035e-29 * Has statistical significance. ***** V28 ***** * Mean = -0.88010972003384 * Median = -0.898860059 * Min = -9.269488679 * Max = 7.415659356 * Range = -9.269488679 to 7.415659356 ( 16.685 ) * Mode = -9.269488679 * Standard Deviation = 1.919380914325925 * Variance = 3.6840230942786243 * Normal distribution. * Shapiro test statistic: 0.9997599753632027 * P-value: 0.004803981209627207 * Has statistical significance. ***** V29 ***** * Mean = -1.00766110468488 * Median = -1.208240071 * Min = -12.57946901 * Max = 14.0394656 * Range = -12.57946901 to 14.0394656 ( 26.619 ) * Mode = -12.57946901 * Standard Deviation = 2.678917022058477 * Variance = 7.17659641107466 * Normal distribution. * Shapiro test statistic: 0.9915985315249529 * P-value: 3.531487195727506e-35 * Has statistical significance. ***** V30 ***** * Mean = -0.03616721185727999 * Median = 0.1654760285 * Min = -14.79604656 * Max = 12.50581239 * Range = -14.79604656 to 12.50581239 ( 27.302 ) * Mode = -14.79604656 * Standard Deviation = 3.009096232662127 * Variance = 9.054660137421404 * Normal distribution. * Shapiro test statistic: 0.992967023738028 * P-value: 1.3102140657834615e-32 * Has statistical significance. ***** V31 ***** * Mean = 0.48323592443684 * Median = 0.4895547575 * Min = -13.72275956 * Max = 17.25509041 * Range = -13.72275956 to 17.25509041 ( 30.978 ) * Mode = -13.72275956 * Standard Deviation = 3.4583156909552346 * Variance = 11.95994741830718 * Normal distribution. * Shapiro test statistic: 0.9994735573312539 * P-value: 1.073310613967571e-06 * Has statistical significance. ***** V32 ***** * Mean = 0.28955294566932 * Median = 0.0259220855 * Min = -19.87650214 * Max = 26.53939068 * Range = -19.87650214 to 26.53939068 ( 46.416 ) * Mode = -19.87650214 * Standard Deviation = 5.51751223998585 * Variance = 30.442941318393668 * Normal distribution. * Shapiro test statistic: 0.9962946146393918 * P-value: 3.397244789948725e-24 * Has statistical significance. ***** V33 ***** * Mean = 0.023837261959680004 * Median = -0.08387902950000001 * Min = -16.8983529 * Max = 16.69248639 * Range = -16.8983529 to 16.69248639 ( 33.591 ) * Mode = -16.8983529 * Standard Deviation = 3.568290798297324 * Variance = 12.732699221213354 * Normal distribution. * Shapiro test statistic: 0.9952804877613421 * P-value: 3.2397580188697404e-27 * Has statistical significance. ***** V34 ***** * Mean = -0.44869444283312004 * Median = -0.2380793395 * Min = -17.98509378 * Max = 14.3582134 * Range = -17.98509378 to 14.3582134 ( 32.343 ) * Mode = -17.98509378 * Standard Deviation = 3.180361064566917 * Variance = 10.114696501013215 * Normal distribution. * Shapiro test statistic: 0.9825169891434745 * P-value: 7.564417566516209e-47 * Has statistical significance. ***** V35 ***** * Mean = 2.225936923212 * Median = 2.1031865444999998 * Min = -15.34980284 * Max = 15.29106465 * Range = -15.34980284 to 15.29106465 ( 30.641 ) * Mode = -15.34980284 * Standard Deviation = 2.9393205650200116 * Variance = 8.639605383949561 * Normal distribution. * Shapiro test statistic: 0.9942454854786105 * P-value: 7.834303119977765e-30 * Has statistical significance. ***** V36 ***** * Mean = 1.53081599173408 * Median = 1.586290521 * Min = -14.83317783 * Max = 19.3295757 * Range = -14.83317783 to 19.3295757 ( 34.163 ) * Mode = -14.83317783 * Standard Deviation = 3.795756219085477 * Variance = 14.407765274726074 * Normal distribution. * Shapiro test statistic: 0.9989497231765267 * P-value: 1.6862086009386057e-11 * Has statistical significance. ***** V37 ***** * Mean = 0.013638774801759998 * Median = -0.1223687705 * Min = -5.47835049 * Max = 7.467006174 * Range = -5.47835049 to 7.467006174 ( 12.945 ) * Mode = -5.47835049 * Standard Deviation = 1.7875664714847763 * Variance = 3.1953938899765335 * Normal distribution. * Shapiro test statistic: 0.9932815623400294 * P-value: 5.781310925724801e-32 * Has statistical significance. ***** V38 ***** * Mean = -0.35635212548096 * Median = -0.33311470600000004 * Min = -17.37500188 * Max = 15.28992265 * Range = -17.37500188 to 15.28992265 ( 32.665 ) * Mode = -17.37500188 * Standard Deviation = 3.952310981229185 * Variance = 15.620762092344803 * Normal distribution. * Shapiro test statistic: 0.9998405321934706 * P-value: 0.07070570006012407 * Does not have statistical significance. ***** V39 ***** * Mean = 0.9002824334052 * Median = 0.9262801039999999 * Min = -6.438880356 * Max = 7.759876885 * Range = -6.438880356 to 7.759876885 ( 14.199 ) * Mode = -6.438880356 * Standard Deviation = 1.7458766993074082 * Variance = 3.04808544918453 * Normal distribution. * Shapiro test statistic: 0.9996761798845495 * P-value: 0.00032582666594564084 * Has statistical significance. ***** V40 ***** * Mean = -0.88698525381304 * Median = -0.9408704345000001 * Min = -11.02393546 * Max = 10.65426542 * Range = -11.02393546 to 10.65426542 ( 21.678 ) * Mode = -11.02393546 * Standard Deviation = 3.0054194915687313 * Variance = 9.032546320301252 * Normal distribution. * Shapiro test statistic: 0.9993248552075727 * P-value: 2.980300397548988e-08 * Has statistical significance. ***** Target ***** * Mean = 0.05568 * Median = 0.0 * Min = 0 * Max = 1 * Range = 0 to 1 ( 1 ) * Mode = 0 * Standard Deviation = 0.22930730662941154 * Variance = 0.05258184087363496 * Normal distribution. * Shapiro test statistic: 0.23995223932959198 * P-value: 1.8743435544527634e-134 * Has statistical significance.
# Drop index for this purpose
df_eda = df_eda.drop(['index'], axis=1)
# Analyze features
for feature in df_eda.columns:
histogram_boxplot(df_eda, feature, figsize=(12, 7), kde=False, bins=None)
# Examine any correlations and highlight anything > .75
corr_matrix = df_eda.corr()
# Create a mask for any correlations <= 0.75
mask = np.abs(corr_matrix) <= 0.75 # to handle both positive and negative correlations
# Define color palette
cmap = sns.diverging_palette(220, 20, as_cmap=True)
plt.figure(figsize=(20, 12))
# Plot heatmap with the mask
ax = sns.heatmap(corr_matrix, mask=mask, cmap=cmap, center=0, annot=True, fmt=".1f", linewidths=.5, vmin=-1, vmax=1)
# Rotate y-axis labels
plt.yticks(rotation=0)
plt.title('Strong Correlation Heatmap (+/- .75)')
plt.show()
OBSERVATIONS
Values
Most common target value is 0.
Target variable is imbalanced. We will need to address this later during modeling.
Nearly all features outside of target have a few outliers.
Series V1 has 24977 unique values, 0 duplicate values, and 23 missing values.
Series V2 has 24976 unique values, 0 duplicate values, and 24 missing values.
Series V3 to V40 all have 25000 unique values, 0 duplicate values, and 24 missing values.
Series Target has 2 unique values, 24998 duplicate values, and 0 missing values.
Statistical Significance and Distribution
Correlation
Next, let's visualize some of the other (non-Target) correlations we detected earlier. All of the following items had correlations higher than +/-.75 with other variables.
In addition, we'll also visualize V15, V21 and V18 with regard to our Target variable.
# Create barplots of important features against target variable
cols = df_eda[['V2', 'V3', 'V7', 'V8', 'V9', 'V11', 'V14', 'V15', 'V16', 'V18', 'V19', 'V21', 'V24', 'V25', 'V32', 'V27', 'V30', 'V38', 'V34']].columns.tolist()
plt.figure(figsize=(15,10))
# Loop through each important feature
for i, variable in enumerate(cols):
plt.subplot(4, 5, i+1)
sns.boxplot(data=df_eda, x=df_eda['Target'], y=variable, showfliers=False, hue=df_eda['Target'])
plt.ylim(-3, 3)
plt.tight_layout()
plt.legend().set_visible(False)
plt.title(variable)
plt.show()
# Create pairplots for the same
# Time this as it takes very long to run
%%time
# Ensure 'Target' is included
df_eda_pairplot = df_eda[cols + ['Target']]
# Create pairplot
sns.pairplot(df_eda_pairplot, hue='Target', diag_kind='kde')
# Show the plot
plt.show()
Output hidden; open in https://colab.research.google.com to view.
NOTE The correlation heatmap is very complex, and renders fine in the notebook but will not export to HTML. Therefore, I have also included this as a screenshot image below for visibility.
OBSERVATIONS
We can observe that some of the V variables have distinctly diagonal plots, indicating strong positive or negative correlations. These confirm the correlations we noted earlier for the various pairs.
Before we can build the model, we need to perform some data processing to prepare the raw data.
We will also define the functions that we'll use later for model building and construction.
Before we begin, let's revisit the goal of this project:
GOAL: Predict generator failures in order to minimize overall maintenance costs through proactive inspections or repairs vs. full replacements.
Null Hypothesis:
Alternative Hypothesis:
The goal of our model is to identify which variables most accurately predict generator failures, and produce a tuned model that can most accurately predict the most expensive failures.
False negatives (FN) - The failure occurred, but was not predicted.
True positives (TP): - The failure was predicted accurately, and the company must spend money on repairs.
False positives (FP) - The model predicted a failure, but none occurred
For efficiency, let's define some useful functions to apply throughout model building.
# Libraries to build linear model for statistical analysis and prediction
import statsmodels.stats.api as sms
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.tools.tools import add_constant
# Libraries to aid in scoring
from sklearn import metrics
from sklearn.metrics import recall_score
from sklearn.model_selection import cross_val_score
# Libraries to import decision tree classifier and different ensemble classifiers
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from xgboost import XGBClassifier
# Libraries for ensemble modeling (stacking)
from sklearn.ensemble import (
BaggingClassifier,
RandomForestClassifier,
AdaBoostClassifier,
GradientBoostingClassifier,
StackingClassifier
)
# Libraries for model tuning
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import Ridge, Lasso
# Other helpful libraries
import math
from itertools import product
from collections import Counter
# Function to compute metrics to check performance of a classification model built using sklearn
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
'''
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
'''
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
'Accuracy': acc,
'Recall': recall,
'Precision': precision,
'F1': f1,
},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
'''
Function to plot a confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
'''
# Predict using independent variables
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
['{0:0.0f}'.format(item) + '\n{0:.2%}'.format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt='')
plt.ylabel('True')
plt.xlabel('Predicted')
# Function to show number of rows and columns
def print_dataframe_shape(df_name, df):
'''
Function to print number of rows and columns
'''
print(f'Number of rows and columns for {df_name}:')
print(df.shape)
Next, we want to define which type of scorer to use for cross-validation and hyperparameter tuning. But, how should we decide which metrics to target?
As noted in the project: It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.
Based on this, and the fact that the model should excel at prevening False Negatives (highest cost implication), we should maximize Recall.
Recall seeks to minimize false negatives by predicting which generators will fail. Then, they can be inspected or repaired ahead of full replacement, for less cost than a full replacement.
Next, let's specify Recall as our scorer metric during cross-validation and hyperparameter tuning.
# Initiate scorer
scorer = metrics.make_scorer(metrics.recall_score)
Before generating any models, we need to perform data cleanup.
We know at this point that all of our features are numerical and are nearly all unique values. We will not need to perform one-hot encoding or dummy variable creation as part of feature engineering.
Our target variable is an integer and is represented as 0 (No Failure) or 1 (Failure). This will become our y variable.
Next, we will follow these steps to complete our preprocessing:
# Create a copy of the original datasets
# These were already split up for us.
df_train_backup = df_train.copy()
df_test_backup = df_test.copy()
# Revisit counts of each existing dataset
print_dataframe_shape('df_train', df_train)
print_dataframe_shape('df_test', df_test)
Number of rows and columns for df_train: (20000, 41) Number of rows and columns for df_test: (5000, 41)
# Drop the target variable from the raw datasets
# Training data
X_train_raw = df_train.drop(['Target'], axis=1)
y_train_raw = df_train['Target']
# Test data
X_test_raw = df_test.drop(['Target'], axis=1)
y_test_raw = df_test['Target']
# Split the training dataset into two: training and validation sets, with a 70/30 split
X_train, X_validation, y_train, y_validation = train_test_split(
X_train_raw, y_train_raw, test_size=0.25, random_state=1, stratify=y_train_raw
)
# New test datasets we'll use for our models, based on the original raw data
X_test = X_test_raw
y_test = y_test_raw
# New train and validation data
print_dataframe_shape('X_train', X_train)
print_dataframe_shape('y_train', y_train)
print_dataframe_shape('X_validation', X_validation)
print_dataframe_shape('y_validation', y_validation)
# Original test data
print_dataframe_shape('X_test', X_test)
print_dataframe_shape('y_test', y_test)
Number of rows and columns for X_train: (15000, 40) Number of rows and columns for y_train: (15000,) Number of rows and columns for X_validation: (5000, 40) Number of rows and columns for y_validation: (5000,) Number of rows and columns for X_test: (5000, 40) Number of rows and columns for y_test: (5000,)
# Find nulls in training data
for col in X_train.columns:
if X_train[col].isnull().sum() > 0:
print('Missing value count for',col,':', X_train[col].isnull().sum())
print(' Min:', X_train[col].min())
print(' Max:', X_train[col].max())
print(' Range:', X_train[col].max()-X_train[col].min())
print(' Mean:', X_train[col].mean())
print(' Median:', X_train[col].median())
Missing value count for V1 : 15 Min: -11.69148362 Max: 13.59449207 Range: 25.28597569 Mean: -0.2863742265735069 Median: -0.765363968 Missing value count for V2 : 14 Min: -12.31995112 Max: 13.08926877 Range: 25.409219890000003 Mean: 0.4407825706436007 Median: 0.46865727199999996
# Find nulls in validation data
for col in X_validation.columns:
if X_validation[col].isnull().sum() > 0:
print('Missing value count for',col,':', X_validation[col].isnull().sum())
print(' Min:', X_validation[col].min())
print(' Max:', X_validation[col].max())
print(' Range:', X_validation[col].max()-X_validation[col].min())
print(' Mean:', X_validation[col].mean())
print(' Median:', X_validation[col].median())
Missing value count for V1 : 3 Min: -11.87645069 Max: 15.49300222 Range: 27.36945291 Mean: -0.22887998734900936 Median: -0.707505303 Missing value count for V2 : 4 Min: -10.46760681 Max: 11.71257436 Range: 22.180181169999997 Mean: 0.4393718537067655 Median: 0.480740976
# Find nulls in test data
for col in X_test.columns:
if X_test[col].isnull().sum() > 0:
print('Missing value count for',col,':', X_test[col].isnull().sum())
print(' Min:', X_test[col].min())
print(' Max:', X_test[col].max())
print(' Range:', X_test[col].max()-X_test[col].min())
print(' Mean:', X_test[col].mean())
print(' Median:', X_test[col].median())
Missing value count for V1 : 5 Min: -12.38169567 Max: 13.50435159 Range: 25.886047259999998 Mean: -0.2776216101771772 Median: -0.764766844 Missing value count for V2 : 6 Min: -10.71617927 Max: 14.07907276 Range: 24.79525203 Mean: 0.3979275522226672 Median: 0.4273690785
We need to address nulls / NaN for series V1 and V2.
Since we know that all of our features are numerical, we can use the SimpleImputer function to train each dataset during model fitting. This will prevent data leakage.
Because we don't have any categorical values, we will not need to generate any dummy variables.
# Instantiate SimpleImputer for imputing the missing values with the media
imp_mode = SimpleImputer(missing_values=pd.NA, strategy='median')
# Use the SimpleImputer to transform each data set and fill in missing values with the media (50% quartile) value
X_train[:] = imp_mode.fit_transform(X_train[:])
X_validation[:] = imp_mode.transform(X_validation[:])
X_test[:] = imp_mode.transform(X_test[:])
# Re-check for nulls
j = 0
# Find nulls in training data
for col in X_train.columns:
if X_train[col].isnull().sum() > 0:
print('Training: Missing value count for',col,':', X_train[col].isnull().sum())
j = X_train[col].isnull().sum()
# Find nulls in validation data
for col in X_validation.columns:
if X_validation[col].isnull().sum() > 0:
print('Validation: Missing value count for',col,':', X_validation[col].isnull().sum())
j = j + X_validation[col].isnull().sum()
# Find nulls in test data
for col in X_test.columns:
if X_test[col].isnull().sum() > 0:
print('Missing value count for',col,':', X_test[col].isnull().sum())
j = j + X_test[col].isnull().sum()
print(f'Finished checking datasets. {j} NaN values found.')
Finished checking datasets. 0 NaN values found.
Earlier, we detected that we don't have a lot of outliers and that most of our data is fairly normally distributed. We also do not have any business context of the various encoded V* variables.
For these reasons, we will not remove any outliers.
This dataset has a lot of variables. High imensionality will impact our modeling. Normally, we should reduce variables that don't have impact on our target variable. But in this case, the variables are encoded and we do not have context. Therefore it would be risky to remove any of the data.
That said, earlier using corr(), we found that three variables have measurable impact on our target variable:
While the other V* series have some cross-impact, they have little to no correlation with generator failure.
Once we create our final model, we will evaluate feature importance as a whole so that we can assess which ones actually impact our model. In future runs, we could perhaps use that information and discuss with the business the possibility of removing some of the variables from our analysis.
No other good candidates exist for feature engineering, as this is a dataset of abstract, numerical values. We do not need to generate dummy variables or convert categorical variables to measurable numeric ones.
Now that our dataset has been split and cleaned, we can start model generation.
The goal of our model is to identify which variables most accurately predict generator failures, and produce a tuned model that can most accurately predict the most expensive failures.
Let's run a performance check on the model prior to any tuning. We will also store the results so we can reference the improvements later.
# Initiate dictionary to store model names
model_0_dict = {}
# Appending models into the dictionary
model_0_dict = {'0_Bagging' : BaggingClassifier(random_state=1)
,'0_RandomForest' : RandomForestClassifier(random_state=1)
,'0_GradientBoost' : GradientBoostingClassifier(random_state=1)
,'0_AdaBoost' : AdaBoostClassifier(random_state=1)
,'0_XGBoost' : XGBClassifier(random_state=1, eval_metric='logloss')
,'0_DecisionTree' : DecisionTreeClassifier(random_state=1)
,'0_LogisticRegression' : LogisticRegression(random_state=1)
}
# Train each of the models in our model list
# Time the run
%%time
# Create empty lists to store names and results for Run 0
models_0 = model_0_dict.items()
names_0 = []
results_0 = []
print('\nModel Run 0: Cross-Validation performance on training data:')
# Loop through mode list to calculate mean cross validation scores
for name, model in models_0:
kfold = StratifiedKFold(
n_splits=5
, shuffle=True
, random_state=1
)
cv_result = cross_val_score(
estimator=model
, X=X_train
, y=y_train
, scoring = scorer
,cv=kfold
)
results_0.append(cv_result)
names_0.append(name)
print('{}: {}'.format(name, cv_result.mean()))
print('\nModel Run 0: Performance on validation data:')
for name, model in models_0:
model.fit(X_train, y_train)
scores = recall_score(y_validation, model.predict(X_validation))
print('{}: {}'.format(name, scores))
Cross-Validation performance on training data model_0_Bagging: 0.7210807301060529 model_0_RandomForest: 0.7235192266070268 model_0_GradientBoost: 0.7066661857008874 model_0_AdaBoost: 0.6309140754635308 model_0_XGBoost: 0.8100497799581561 model_0_DecisionTree: 0.6982829521679532 model_1_LogisticRegression: 0.4927566553639709 Validation Performance: model_0_Bagging: 0.7302158273381295 model_0_RandomForest: 0.7266187050359713 model_0_GradientBoost: 0.7230215827338129 model_0_AdaBoost: 0.6762589928057554 model_0_XGBoost: 0.8309352517985612 model_0_DecisionTree: 0.7050359712230215 model_1_LogisticRegression: 0.48201438848920863 CPU times: user 7min 50s, sys: 1.33 s, total: 7min 51s Wall time: 9min 24s
# Plotting boxplots for CV scores for Run 0
fig = plt.figure(figsize=(15, 7))
fig.suptitle("Model Run 0: Algorithm Performance Comparison")
ax = fig.add_subplot(1, 1, 1)
plt.boxplot(results_0)
ax.set_xticklabels(names_0)
plt.xticks(rotation=45)
plt.show()
We trained the raw data using seven different training algoritms. In the fitted result, neither AdaBoost nor Logistic Ridge Regression performed well, but the other models show promise and all exceed 70% Recall.
Best Performing Models:
Overall Performance:
Training (Sorted from highest to lowest):
Validation (Sorted from highest to lowest):
In checking for class imbalance next, we can see that our minority class is underrepresented. Ideally, we will have at least a 10% minority to majority ratio.
# Check class imbalance for training, validation and test data
print('*********************************')
print('Target ratio: y_train')
print(y_train.value_counts(1))
print('*********************************')
print('Target ratio: y_validation')
print(y_validation.value_counts(1))
print('*********************************')
print('Target ratio: y_test')
print(y_test.value_counts(1))
********************************* Target ratio: y_train Target 0 0.944533 1 0.055467 Name: proportion, dtype: float64 ********************************* Target ratio: y_validation Target 0 0.9444 1 0.0556 Name: proportion, dtype: float64 ********************************* Target ratio: y_test Target 0 0.9436 1 0.0564 Name: proportion, dtype: float64
Since undersampling the majority class would result in losing quite a bit of information, we will use oversampling to obtain at least a 10/1 ratio of majority to minority class counts for each target value.
SMOTE (Synthetic Minority Oversampling Technique) will help us accomplish this by generating synthetic values based on neighboring samples.
# Check the distribution of classes
class_distribution = Counter(y_train)
print(class_distribution)
Counter({0: 14168, 1: 832})
# Check unique class labels
unique_classes = np.unique(y_train)
print(f'Unique classes in y_train: {unique_classes}')
Unique classes in y_train: [0 1]
# Apply Synthetic Minority Over Sampling Technique to raise class balance to 10%
# Define ratio
desired_ratio = 0.1 # 10% of the total dataset; we are at 5% now.
# Initialize and apply SMOTE with the desired ratio
sm_train = SMOTE(sampling_strategy=desired_ratio
, k_neighbors=5
, random_state=1
)
# Apply SMOTE to resample training data
X_train_over, y_train_over = sm_train.fit_resample(X_train, y_train)
# Fit model
X_train_over, y_train_over = sm_train.fit_resample(X_train, y_train)
# Check new class distribution
print('New class distribution:', Counter(y_train_over))
New class distribution: Counter({0: 14168, 1: 1416})
print("Pre-OverSampling, y_train = '1': {}".format(sum(y_train == 1)))
print("Pre-OverSampling, y_train = '0': {} \n".format(sum(y_train == 0)))
print("Post-OverSampling, y_train = '1': {}".format(sum(y_train_over == 1)))
print("Post-OverSampling, y_train = '0': {} \n".format(sum(y_train_over == 0)))
print_dataframe_shape('New X_train', X_train_over)
print_dataframe_shape('New y_train', y_train_over)
Pre-OverSampling, y_train = '1': 832 Pre-OverSampling, y_train = '0': 14168 Post-OverSampling, y_train = '1': 1416 Post-OverSampling, y_train = '0': 14168 Number of rows and columns for New X_train: (15584, 40) Number of rows and columns for New y_train: (15584,)
Let's also try under sampling to obtain at least a 10/1 ratio of majority to minority class counts for each target value.
RandomUnderSampler() will help us accomplish this by expanding the impact of the minority class by dropping majority samples.
# Apply undersampling
random_us = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_under, y_train_under = random_us.fit_resample(X_train, y_train)
# Check new class distribution
print('New class distribution:', Counter(y_train_over))
New class distribution: Counter({0: 14168, 1: 1416})
print("Pre-Undersampling, y_train = '1': {}".format(sum(y_train == 1)))
print("Pre-Undersampling, y_train = '0': {} \n".format(sum(y_train == 0)))
print("Post-Undersampling, y_train = '1': {}".format(sum(y_train_under == 1)))
print("Post-Undersampling, y_train = '0': {} \n".format(sum(y_train_under == 0)))
print_dataframe_shape('New X_train', X_train_under)
print_dataframe_shape('New y_train', y_train_under)
Pre-Undersampling, y_train = '1': 832 Pre-Undersampling, y_train = '0': 14168 Post-Undersampling, y_train = '1': 832 Post-Undersampling, y_train = '0': 832 Number of rows and columns for New X_train: (1664, 40) Number of rows and columns for New y_train: (1664,)
Let's see how the oversampled and undersampled models performs. To do this, we'll repeat the steps from earlier.
# Initiate new dictionary to store model names and dataframe to store trained models
model_1_dict = {}
# Appending models into the dictionary
model_1_dict = {'1_Bagging' : BaggingClassifier(random_state=1)
,'1_RandomForest' : RandomForestClassifier(random_state=1)
,'1_GradientBoost' : GradientBoostingClassifier(random_state=1)
,'1_AdaBoost' : AdaBoostClassifier(random_state=1)
,'1_XGBoost' : XGBClassifier(random_state=1, eval_metric='logloss')
,'1_DecisionTree' : DecisionTreeClassifier(random_state=1)
,'1_LogisticRegression' : LogisticRegression(random_state=1)
}
# Check new dictionary
model_1_dict
{'1_Bagging': BaggingClassifier(random_state=1),
'1_RandomForest': RandomForestClassifier(random_state=1),
'1_GradientBoost': GradientBoostingClassifier(random_state=1),
'1_AdaBoost': AdaBoostClassifier(random_state=1),
'1_XGBoost': XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=None, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=None, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=None,
n_jobs=None, num_parallel_tree=None, random_state=1, ...),
'1_DecisionTree': DecisionTreeClassifier(random_state=1),
'1_LogisticRegression': LogisticRegression(random_state=1)}
# Run model using the oversampled data
# Time the run
%%time
# Create empty lists to store names and results for Run 0
models_1a = model_1_dict.items()
names_1a = []
results_1a = []
print('\nRun 1a: Cross-Validation performance on oversampled training data:')
# Loop through mode list to calculate mean cross validation scores
for name, model in models_1a:
kfold = StratifiedKFold(
n_splits=5
, shuffle=True
, random_state=1
)
cv_result = cross_val_score(
estimator=model
, X=X_train_over
, y=y_train_over
, scoring = scorer
,cv=kfold
)
results_1a.append(cv_result)
names_1a.append(name)
print('{}: {}'.format(name, cv_result.mean()))
print('\nRun 1a: Performance on validation data:')
for name, model in models_1a:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_validation, model.predict(X_validation))
print('{}: {}'.format(name, scores))
Run 1a: Cross-Validation performance on oversampled training data: 1_Bagging: 0.7937913701289008 1_RandomForest: 0.8248643806300701 1_GradientBoost: 0.7824963917782313 1_AdaBoost: 0.6786592345592992 1_XGBoost: 0.8778330761956902 1_DecisionTree: 0.7895685064450306 1_LogisticRegression: 0.5712885084357736 Run 1a: Performance on validation data: 1_Bagging: 0.7553956834532374 1_RandomForest: 0.7913669064748201 1_GradientBoost: 0.7913669064748201 1_AdaBoost: 0.6762589928057554 1_XGBoost: 0.8453237410071942 1_DecisionTree: 0.7482014388489209 1_LogisticRegression: 0.564748201438849 CPU times: user 7min 32s, sys: 1.41 s, total: 7min 33s Wall time: 8min 32s
OBSERVATIONS
Best Performing Models (Sorted from highest to lowest):
Validation Data Best Performing Models (Sorted from highest to lowest):
# Run this again, but using the undersampled data
# Time the run
%%time
# Create empty lists to store names and results for Run 0
models_1b = model_1_dict.items()
names_1b = []
results_1b = []
print('\nRun 1b: Cross-Validation performance on undersampled training data:')
# Loop through mode list to calculate mean cross validation scores
for name, model in models_1b:
kfold = StratifiedKFold(
n_splits=5
, shuffle=True
, random_state=1
)
cv_result = cross_val_score(
estimator=model
, X=X_train_under
, y=y_train_under
, scoring = scorer
,cv=kfold
)
results_1b.append(cv_result)
names_1b.append(name)
print('{}: {}'.format(name, cv_result.mean()))
print('\nRun 1b: Performance on validation data:')
for name, model in models_1b:
model.fit(X_train_under, y_train_under)
scores = recall_score(y_validation, model.predict(X_validation))
print('{}: {}'.format(name, scores))
Run 1b: Cross-Validation performance on undersampled training data: 1_Bagging: 0.8641945025611427 1_RandomForest: 0.9038669648654498 1_GradientBoost: 0.8990621167303946 1_AdaBoost: 0.8666113556020489 1_XGBoost: 0.9014717552846114 1_DecisionTree: 0.8617776495202367 1_LogisticRegression: 0.8726138085275232 Run 1b: Performance on validation data: 1_Bagging: 0.8705035971223022 1_RandomForest: 0.8920863309352518 1_GradientBoost: 0.8884892086330936 1_AdaBoost: 0.8489208633093526 1_XGBoost: 0.89568345323741 1_DecisionTree: 0.841726618705036 1_LogisticRegression: 0.8525179856115108 CPU times: user 39 s, sys: 283 ms, total: 39.3 s Wall time: 36.8 s
OBSERVATIONS
Overall, undersampled data generally shows superior performance compared to oversampled data in both cross-validation and validation phases. This indicates that the models trained on undersampled data have better generalization performance.
Best Performance on Training Data:
Best Performance on Validation Data:
Best Performing Models on Training data (Sorted from highest to lowest):
Best Performing Models on validation data (Sorted from highest to lowest):
Through a round of modeling, and two follow-up runs using oversampled and undersampled data, we were able to identify candidate models. Clearly, training with undersampled data resulted in superior validation, while the oversampled data improved each validation performance score by a few percentage points.
Top Performers
Normal Training Data
Oversampled Training Data
Undersampled Training Data
# Plotting boxplots for CV scores for Run 1a (oversampled data)
fig = plt.figure(figsize=(15, 7))
fig.suptitle('Model Run 1a: Algorithm Performance Comparison (Oversampled Data)')
ax = fig.add_subplot(1, 1, 1)
plt.boxplot(results_1a)
ax.set_xticklabels(names_1a)
plt.xticks(rotation=45)
plt.show()
# Plotting boxplots for CV scores for Run 1b (undersampled data)
fig = plt.figure(figsize=(15, 7))
fig.suptitle('Model Run 1b: Algorithm Performance Comparison (Undersampled Data)')
ax = fig.add_subplot(1, 1, 1)
plt.boxplot(results_1b)
ax.set_xticklabels(names_1b)
plt.xticks(rotation=45)
plt.show()
Based on our initial run with normal, oversampled and undersampled data, we found models with the most promise. We'll train three in particular using the undersampled data.
XGBoost: XGBoost consistently performs well across both training and validation datasets, making it a strong candidate for tuning.
Random Forest: RandomForest shows the highest cross-validation score on undersampled data and performs competitively on validation data.
Gradient Boost: GradientBoost has strong performance in cross-validation and validation phases, making it a viable model for tuning.
We will select these models based on their high performance metrics in both training and validation phases, specifically focusing on their ability to generalize well to unseen data.
Next, let's tune these models to help further with optimizing performance.
Let's tune the XG Boost model using the recommended hyperparameters provided with the project:
param_grid = {
'n_estimators': [150, 200, 250],
'scale_pos_weight': [5, 10],
'learning_rate': [0.1, 0.2],
'gamma': [0, 3, 5],
'subsample': [0.8, 0.9]
}
# Time model building
%%time
# Tuning XGBoost
estimator_xgboost = XGBClassifier(random_state=1, eval_metric='logloss')
# Grid of recommended tuning parameters
parameter_grid_xgb = {
'n_estimators':np.arange(50,110,25), # Reduced due to modeling time
'scale_pos_weight': [5, 10],
'learning_rate': [0.1, 0.2],
'gamma': [0, 3, 5],
'subsample': [0.8, 0.9]
}
# Run the Grid CV search
grid_obj_xgboost = GridSearchCV(
estimator_xgboost,
parameter_grid_xgb,
scoring=scorer
)
# Fit the model to the data, using undersampled data
grid_obj_xgboost = grid_obj_xgboost.fit(X_train_under, y_train_under)
# Set the model to the best combination of parameters
tuned_model_xgboost = grid_obj_xgboost.best_estimator_
# Fit the model to the data using best estimator
tuned_model_xgboost.fit(X_train_under, y_train_under)
# Print out best parameters
print("Best parameters are {} with CV score={}:" .format(grid_obj_xgboost.best_params_, grid_obj_xgboost.best_score_))
Best parameters are {'gamma': 5, 'learning_rate': 0.1, 'n_estimators': 50, 'scale_pos_weight': 10, 'subsample': 0.8} with CV score=0.9338720150061324:
CPU times: user 6min 1s, sys: 1.36 s, total: 6min 2s
Wall time: 3min 53s
# Calculating metrics on training data
tuned_model_xgboost_train = model_performance_classification_sklearn(
tuned_model_xgboost, X_train, y_train
)
print('Training performance:')
tuned_model_xgboost_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.827333 | 1.0 | 0.243133 | 0.391161 |
# Calculating metrics on validation data
tuned_model_xgboost_val = model_performance_classification_sklearn(
tuned_model_xgboost, X_validation, y_validation
)
print('Validation performance:')
tuned_model_xgboost_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.8104 | 0.917266 | 0.216102 | 0.349794 |
# creating confusion matrix for predictions
confusion_matrix_sklearn(tuned_model_xgboost, X_validation, y_validation)
Let's tune the Random Forest model using the recommended hyperparameters provided with the project:
param_grid = {
"n_estimators": [200, 250, 300],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1), 'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)
}
# Time model building
%%time
# Tuning Random Forest
estimator_random_forest = RandomForestClassifier(random_state=1)
# Grid of recommended tuning parameters
parameter_grid_rf = {
'n_estimators': [50,110,25],
'min_samples_leaf': np.arange(1, 4),
'max_features': [np.arange(0.3, 0.6, 0.1),'sqrt'],
'max_samples': np.arange(0.4, 0.7, 0.1),
'class_weight': [{0: 0.33, 1: 0.66}]
}
# Run the randomized search version to assist with runtime issues
grid_obj_random_forest = GridSearchCV(
estimator_random_forest,
parameter_grid_rf,
scoring=scorer
)
# Fit the model to the data
grid_obj_random_forest = grid_obj_random_forest.fit(X_train_under, y_train_under)
# Set the model to the best combination of parameters
tuned_model_random_forest = grid_obj_random_forest.best_estimator_
# Fit the model to the data using the best estimator
tuned_model_random_forest.fit(X_train_under, y_train_under)
print("Best parameters are {} with CV score={}:" .format(grid_obj_random_forest.best_params_, grid_obj_random_forest.best_score_))
Best parameters are {'class_weight': {0: 0.33, 1: 0.66}, 'max_features': 'sqrt', 'max_samples': 0.5, 'min_samples_leaf': 3, 'n_estimators': 50} with CV score=0.9038381069186926:
CPU times: user 51.7 s, sys: 133 ms, total: 51.9 s
Wall time: 52.2 s
# Calculating metrics on training data
tuned_model_random_forest_train = model_performance_classification_sklearn(
tuned_model_random_forest, X_train, y_train
)
print('Training performance:')
tuned_model_random_forest_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.922467 | 0.936298 | 0.412388 | 0.572584 |
# Calculating metrics on validation data
tuned_model_random_forest_val = model_performance_classification_sklearn(
tuned_model_random_forest, X_validation, y_validation
)
print('Validation performance:')
tuned_model_random_forest_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9112 | 0.881295 | 0.373476 | 0.524625 |
# creating confusion matrix for predictions
confusion_matrix_sklearn(tuned_model_random_forest, X_validation, y_validation)
Let's perform hyperparameter tuning using the recommended parameters.
param_grid = {
'n_estimators': np.arange(100, 150, 25),
'learning_rate': [0.2, 0.05, 1],
'subsample': [0.5, 0.7],
'max_features': [0.5, 0.7]
}
# Time model building
%%time
# Tuning Gradient Boost
estimator_gradient_boost = GradientBoostingClassifier(random_state=1)
# Grid of recommended tuning parameters
parameter_grid_gb = {
"n_estimators": np.arange(100, 150, 25),
"learning_rate": [0.2, 0.05, 1],
"subsample": [0.5, 0.7],
"max_features": [0.5, 0.7]
}
# Type of scoring used to compare parameter combinations - we care most about f1
scorer_gb = metrics.make_scorer(metrics.f1_score)
# Run the randomized search, since this is more performant, with modified cv
grid_obj_gradient_boost = GridSearchCV(
estimator_gradient_boost,
parameter_grid_gb,
scoring=scorer_gb
)
# Fit the model to the data
grid_obj_gradient_boost = grid_obj_gradient_boost.fit(X_train_under, y_train_under)
# Set the model to the best estimator parameters
tuned_model_gradient_boost = grid_obj_gradient_boost.best_estimator_
# Fit the tuned model to the data with best
tuned_model_gradient_boost.fit(X_train_under, y_train_under)
CPU times: user 2min 24s, sys: 154 ms, total: 2min 24s Wall time: 2min 25s
GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, random_state=1, subsample=0.7)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, random_state=1, subsample=0.7)# Calculating metrics on training data
tuned_model_gradient_boost_train = model_performance_classification_sklearn(
tuned_model_gradient_boost, X_train, y_train
)
print('Training performance:')
tuned_model_gradient_boost_train
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.934467 | 0.977163 | 0.457513 | 0.623227 |
# Calculating metrics on validation data
tuned_model_gradient_boost_val = model_performance_classification_sklearn(
tuned_model_gradient_boost, X_validation, y_validation
)
print('Validation performance:')
tuned_model_gradient_boost_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9226 | 0.884892 | 0.409318 | 0.559727 |
# creating confusion matrix for predictions
confusion_matrix_sklearn(tuned_model_gradient_boost, X_validation, y_validation)
Tuning this model through undersampling, oversampling and hyperparameters greatly improved performance. Optimizing the model for minimizing False Negatives is critical for imbalanced datasets.
The tuned models demonstrate significantly higher recall values compared to the previous models, indicating improved sensitivity to positive class predictions.
Random Forest and Gradient Boost both show improvement. The Random Forest had an 88% Validation score in the previous results, matching its tuned recall, while Gradient Boost improved from a 79% Validation score to an 88% Recall in the tuned version.
XGBoost
Random Forest
Gradient Boost
Next, let's combine our results to create an ensemble model.
We can choose to either train this model on the same undersampled dataset, or use the normal training data. There are advantages to both approaches, but because we want the model to generalize well, let's first try it on the normal (non-undersampled) training data. We'll also try it on the undersampled data as well.
# Add the estimators for stacking our final model
# Construct the metamodel
estimators_ensemble = [('XG Boost', tuned_model_xgboost),
('Gradient Boost', tuned_model_gradient_boost),
('Random Forest', tuned_model_random_forest)]
# Construct the metamodel
estimator_ensemble = tuned_model_xgboost
# Construct the stacked models - we will do three, to handle normal, oversampled, and undersampled data
model_ensemble = StackingClassifier(estimators=estimators_ensemble, final_estimator=estimator_ensemble)
model_ensemble_under = StackingClassifier(estimators=estimators_ensemble, final_estimator=estimator_ensemble)
model_ensemble_over = StackingClassifier(estimators=estimators_ensemble, final_estimator=estimator_ensemble)
# Fit the model to the normal training data
model_ensemble.fit(X_train, y_train)
StackingClassifier(estimators=[('XG Boost',
XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
device=None,
early_stopping_rounds=None,
enable_categorical=False,
eval_metric='logloss',
feature_types=None, gamma=5,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
lea...
feature_types=None, gamma=5,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=0.1,
max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None,
max_depth=None,
max_leaves=None,
min_child_weight=None,
missing=nan,
monotone_constraints=None,
multi_strategy=None,
n_estimators=50, n_jobs=None,
num_parallel_tree=None,
random_state=1, ...))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. StackingClassifier(estimators=[('XG Boost',
XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
device=None,
early_stopping_rounds=None,
enable_categorical=False,
eval_metric='logloss',
feature_types=None, gamma=5,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
lea...
feature_types=None, gamma=5,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=0.1,
max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None,
max_depth=None,
max_leaves=None,
min_child_weight=None,
missing=nan,
monotone_constraints=None,
multi_strategy=None,
n_estimators=50, n_jobs=None,
num_parallel_tree=None,
random_state=1, ...))XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=5, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.1, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=50,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, random_state=1, subsample=0.7)RandomForestClassifier(class_weight={0: 0.33, 1: 0.66}, max_samples=0.5,
min_samples_leaf=3, n_estimators=50, random_state=1)XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=5, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.1, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=50,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)# Run final model to calculate metrics on normal training data
model_ensemble_train = model_performance_classification_sklearn(
model_ensemble, X_train, y_train
)
print('Training performance: Normal Training Data')
model_ensemble_train
Training performance: Normal Training Data
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9914 | 0.967548 | 0.887541 | 0.925819 |
# Run final model to calculate metrics on validation data
model_ensemble_val = model_performance_classification_sklearn(
model_ensemble, X_validation, y_validation
)
print('Validation performance: ')
model_ensemble_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9804 | 0.877698 | 0.792208 | 0.832765 |
# Create confusion matrix for predictions
confusion_matrix_sklearn(model_ensemble, X_validation, y_validation)
# Fit the model to the undersampled training data
model_ensemble_under.fit(X_train_under, y_train_under)
StackingClassifier(estimators=[('XG Boost',
XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
device=None,
early_stopping_rounds=None,
enable_categorical=False,
eval_metric='logloss',
feature_types=None, gamma=5,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
lea...
feature_types=None, gamma=5,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=0.1,
max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None,
max_depth=None,
max_leaves=None,
min_child_weight=None,
missing=nan,
monotone_constraints=None,
multi_strategy=None,
n_estimators=50, n_jobs=None,
num_parallel_tree=None,
random_state=1, ...))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. StackingClassifier(estimators=[('XG Boost',
XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
device=None,
early_stopping_rounds=None,
enable_categorical=False,
eval_metric='logloss',
feature_types=None, gamma=5,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
lea...
feature_types=None, gamma=5,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=0.1,
max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None,
max_depth=None,
max_leaves=None,
min_child_weight=None,
missing=nan,
monotone_constraints=None,
multi_strategy=None,
n_estimators=50, n_jobs=None,
num_parallel_tree=None,
random_state=1, ...))XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=5, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.1, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=50,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, random_state=1, subsample=0.7)RandomForestClassifier(class_weight={0: 0.33, 1: 0.66}, max_samples=0.5,
min_samples_leaf=3, n_estimators=50, random_state=1)XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=5, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.1, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=50,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)# Run final model to calculate metrics on undersampled training data
model_ensemble_train_under = model_performance_classification_sklearn(
model_ensemble_under, X_train_under, y_train_under
)
print('Training performance: Undersampled Training Data')
model_ensemble_train_under
Training performance: Undersampled Training Data
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.926082 | 0.990385 | 0.877529 | 0.930548 |
# Run final model to calculate metrics on validation data
model_ensemble_val_under = model_performance_classification_sklearn(
model_ensemble_under, X_validation, y_validation
)
print('Validation performance: ')
model_ensemble_val_under
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.7632 | 0.899281 | 0.177809 | 0.296912 |
# Create confusion matrix for predictions
confusion_matrix_sklearn(model_ensemble_under, X_validation, y_validation)
# Fit the model to the oversampled training data
model_ensemble_over.fit(X_train_over, y_train_over)
StackingClassifier(estimators=[('XG Boost',
XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
device=None,
early_stopping_rounds=None,
enable_categorical=False,
eval_metric='logloss',
feature_types=None, gamma=5,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
lea...
feature_types=None, gamma=5,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=0.1,
max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None,
max_depth=None,
max_leaves=None,
min_child_weight=None,
missing=nan,
monotone_constraints=None,
multi_strategy=None,
n_estimators=50, n_jobs=None,
num_parallel_tree=None,
random_state=1, ...))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. StackingClassifier(estimators=[('XG Boost',
XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
device=None,
early_stopping_rounds=None,
enable_categorical=False,
eval_metric='logloss',
feature_types=None, gamma=5,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
lea...
feature_types=None, gamma=5,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=0.1,
max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None,
max_depth=None,
max_leaves=None,
min_child_weight=None,
missing=nan,
monotone_constraints=None,
multi_strategy=None,
n_estimators=50, n_jobs=None,
num_parallel_tree=None,
random_state=1, ...))XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=5, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.1, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=50,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, random_state=1, subsample=0.7)RandomForestClassifier(class_weight={0: 0.33, 1: 0.66}, max_samples=0.5,
min_samples_leaf=3, n_estimators=50, random_state=1)XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=5, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.1, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=50,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)# Run final model to calculate metrics on oversampled training data
model_ensemble_train_over = model_performance_classification_sklearn(
model_ensemble_over, X_train_over, y_train_over
)
print('Training performance: Oversampled Training Data')
model_ensemble_train_over
Training performance: Undersampled Training Data
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.926082 | 0.990385 | 0.877529 | 0.930548 |
# Run final model to calculate metrics on validation data
model_ensemble_val_over = model_performance_classification_sklearn(
model_ensemble_over, X_validation, y_validation
)
print('Validation performance: ')
model_ensemble_val_over
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.7632 | 0.899281 | 0.177809 | 0.296912 |
# Create confusion matrix for predictions
confusion_matrix_sklearn(model_ensemble_over, X_validation, y_validation)
Now that we are confident in our validation exercise, we are ready to test the model against unseen Test data. Before we can do this, we must pre-process the data to ensure we have no missing values.
Once the data is ready, we'll run our new model against the test dataset to see how well it generalized.
# Drop the target variable from the test dataset
X_test = df_test.drop(['Target'], axis=1)
y_test = df_test['Target']
#Check the size of the original test data
print_dataframe_shape('X_test', X_test)
print_dataframe_shape('y_test', y_test)
Number of rows and columns for X_test: (5000, 40) Number of rows and columns for y_test: (5000,)
# Use Imputation to fill in missing values (NaN)
imputer = SimpleImputer(strategy='median')
X_test = imputer.fit_transform(X_test)
# Drop the target variable from the test dataset
X_test = df_test.drop(['Target'], axis=1)
y_test = df_test['Target']
#Check the size of the original test data
print_dataframe_shape('X_test', X_test)
print_dataframe_shape('y_test', y_test)
Number of rows and columns for X_test: (5000, 40) Number of rows and columns for y_test: (5000,)
# Use Imputation to fill in missing values (NaN)
imputer = SimpleImputer(strategy='median')
X_test = imputer.fit_transform(X_test)
# Run final model to calculate metrics on test data - use normal ensemble model
model_ensemble_test = model_performance_classification_sklearn(
model_ensemble, X_test, y_test
)
print('Test performance: Model with Normal Data')
model_ensemble_test
Test performance: Model with Normal Data
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.9808 | 0.861702 | 0.81 | 0.835052 |
# Create confusion matrix for predictions
confusion_matrix_sklearn(model_ensemble, X_test, y_test)
# Run final model to calculate metrics on test data - use undersampled ensemble model
model_ensemble_test_under = model_performance_classification_sklearn(
model_ensemble_under, X_test, y_test
)
print('Test performance: Model with Undersampled Data')
model_ensemble_test_under
Test performance: Model with Undersampled Data
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.766 | 0.904255 | 0.182403 | 0.303571 |
# Create confusion matrix for predictions
confusion_matrix_sklearn(model_ensemble_under, X_test, y_test)
# Run final model to calculate metrics on test data - use oversampled ensemble model
model_ensemble_test_over = model_performance_classification_sklearn(
model_ensemble_over, X_test, y_test
)
print('Test performance: Model with Oversampled Data')
model_ensemble_test_over
Test performance: Model with Oversampled Data
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.974 | 0.868794 | 0.724852 | 0.790323 |
# Create confusion matrix for predictions
confusion_matrix_sklearn(model_ensemble_over, X_test, y_test)
Recall:The undersampled model yielded the highest Recall (0.9043), indicating it is best at identifying true positives and minimizes false negatives, though it suffers from low precision.
Accuracy: The normal data model achieves the highest accuracy (0.9808), suggesting it performs well overall, but with slightly lower recall than the undersampled model.
Balance: The oversampled model maintains a good balance between recall (0.8688) and precision (0.7249), making it a strong contender for practical applications.
The undersampled model excels in recall, although the normal and oversampled models provide better overall performance metrics. Since we care most about minimizing False Negatives, we will move forward with the undersampled dataset.
The following bar chart illustrates relative feature importance with regard to generator failure. These results are further summarized in our Insights.
# List out feature importance based on final XGBoost model (the stacked model classifiers do not support feature importance)
feature_names = X_train.columns
importances = tuned_model_xgboost.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='blue', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
# Print these as values
importance_df = pd.DataFrame({
'Feature': feature_names,
'Importance': importances
})
# Sort by importance
importance_df = importance_df.sort_values(by='Importance', ascending=False)
importance_df
| Feature | Importance | |
|---|---|---|
| 35 | V36 | 0.111211 |
| 25 | V26 | 0.042192 |
| 17 | V18 | 0.038492 |
| 13 | V14 | 0.036565 |
| 10 | V11 | 0.035884 |
| 15 | V16 | 0.034713 |
| 14 | V15 | 0.034042 |
| 38 | V39 | 0.032091 |
| 31 | V32 | 0.027499 |
| 9 | V10 | 0.026261 |
| 20 | V21 | 0.025850 |
| 18 | V19 | 0.025181 |
| 11 | V12 | 0.024176 |
| 2 | V3 | 0.023683 |
| 34 | V35 | 0.023568 |
| 5 | V6 | 0.022950 |
| 26 | V27 | 0.022733 |
| 7 | V8 | 0.022354 |
| 36 | V37 | 0.021502 |
| 29 | V30 | 0.021211 |
| 23 | V24 | 0.021187 |
| 19 | V20 | 0.020541 |
| 30 | V31 | 0.020195 |
| 27 | V28 | 0.020072 |
| 1 | V2 | 0.018881 |
| 0 | V1 | 0.018875 |
| 12 | V13 | 0.018471 |
| 3 | V4 | 0.018413 |
| 4 | V5 | 0.017755 |
| 33 | V34 | 0.017538 |
| 22 | V23 | 0.016810 |
| 16 | V17 | 0.016804 |
| 39 | V40 | 0.016611 |
| 32 | V33 | 0.016368 |
| 8 | V9 | 0.015728 |
| 21 | V22 | 0.015293 |
| 28 | V29 | 0.014828 |
| 24 | V25 | 0.014611 |
| 37 | V38 | 0.014447 |
| 6 | V7 | 0.014415 |
Finally, let's construct a reusable pipeline to ensure our model can easily be redeployed into production.
# Invoke the librariees we need
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
# Create pipeline with Stacking
pipeline_stacked = make_pipeline(
StandardScaler(), # Preprocessing step
model_ensemble_under # Stacked model: Best performing one (trained on undersampled data)
)
# Fit the pipeline to the training data
pipeline_stacked.fit(X_train_under, y_train_under)
Pipeline(steps=[('standardscaler', StandardScaler()),
('stackingclassifier',
StackingClassifier(estimators=[('XG Boost',
XGBClassifier(base_score=None,
booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
device=None,
early_stopping_rounds=None,
enable_categorical=False,
eval_metric='logloss',
feature_types=None,
gamm...
gamma=5,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=0.1,
max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None,
max_depth=None,
max_leaves=None,
min_child_weight=None,
missing=nan,
monotone_constraints=None,
multi_strategy=None,
n_estimators=50,
n_jobs=None,
num_parallel_tree=None,
random_state=1, ...)))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('standardscaler', StandardScaler()),
('stackingclassifier',
StackingClassifier(estimators=[('XG Boost',
XGBClassifier(base_score=None,
booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
device=None,
early_stopping_rounds=None,
enable_categorical=False,
eval_metric='logloss',
feature_types=None,
gamm...
gamma=5,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=0.1,
max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None,
max_depth=None,
max_leaves=None,
min_child_weight=None,
missing=nan,
monotone_constraints=None,
multi_strategy=None,
n_estimators=50,
n_jobs=None,
num_parallel_tree=None,
random_state=1, ...)))])StandardScaler()
StackingClassifier(estimators=[('XG Boost',
XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
device=None,
early_stopping_rounds=None,
enable_categorical=False,
eval_metric='logloss',
feature_types=None, gamma=5,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
lea...
feature_types=None, gamma=5,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=0.1,
max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None,
max_depth=None,
max_leaves=None,
min_child_weight=None,
missing=nan,
monotone_constraints=None,
multi_strategy=None,
n_estimators=50, n_jobs=None,
num_parallel_tree=None,
random_state=1, ...))XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=5, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.1, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=50,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)GradientBoostingClassifier(learning_rate=0.2, max_features=0.5,
n_estimators=125, random_state=1, subsample=0.7)RandomForestClassifier(class_weight={0: 0.33, 1: 0.66}, max_samples=0.5,
min_samples_leaf=3, n_estimators=50, random_state=1)XGBClassifier(base_score=None, booster=None, callbacks=None,
colsample_bylevel=None, colsample_bynode=None,
colsample_bytree=None, device=None, early_stopping_rounds=None,
enable_categorical=False, eval_metric='logloss',
feature_types=None, gamma=5, grow_policy=None,
importance_type=None, interaction_constraints=None,
learning_rate=0.1, max_bin=None, max_cat_threshold=None,
max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
max_leaves=None, min_child_weight=None, missing=nan,
monotone_constraints=None, multi_strategy=None, n_estimators=50,
n_jobs=None, num_parallel_tree=None, random_state=1, ...)# Show the pipeline steps
pipeline_stacked.steps
[('standardscaler', StandardScaler()),
('stackingclassifier',
StackingClassifier(estimators=[('XG Boost',
XGBClassifier(base_score=None, booster=None,
callbacks=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
device=None,
early_stopping_rounds=None,
enable_categorical=False,
eval_metric='logloss',
feature_types=None, gamma=5,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
lea...
feature_types=None, gamma=5,
grow_policy=None,
importance_type=None,
interaction_constraints=None,
learning_rate=0.1,
max_bin=None,
max_cat_threshold=None,
max_cat_to_onehot=None,
max_delta_step=None,
max_depth=None,
max_leaves=None,
min_child_weight=None,
missing=nan,
monotone_constraints=None,
multi_strategy=None,
n_estimators=50, n_jobs=None,
num_parallel_tree=None,
random_state=1, ...)))]
# Run final model to calculate metrics on training data
pipeline_stacked_train_under = model_performance_classification_sklearn(
pipeline_stacked, X_train_under, y_train_under
)
print('Training performance:')
pipeline_stacked_train_under
Training performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.926082 | 0.990385 | 0.877529 | 0.930548 |
# Run final model to calculate metrics on val data
pipeline_stacked_val = model_performance_classification_sklearn(
pipeline_stacked, X_validation, y_validation
)
print('Validation performance:')
pipeline_stacked_val
Validation performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.7632 | 0.899281 | 0.177809 | 0.296912 |
# Create confusion matrix for predictions
confusion_matrix_sklearn(pipeline_stacked, X_validation, y_validation)
# Run final model to calculate metrics on test data
pipeline_stacked_test = model_performance_classification_sklearn(
pipeline_stacked, X_test, y_test
)
print('Test performance:')
pipeline_stacked_test
Test performance:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.766 | 0.904255 | 0.182403 | 0.303571 |
# Create confusion matrix for predictions - test data
confusion_matrix_sklearn(pipeline_stacked, X_test, y_test)
The following section details our findings and recommendations to help ReneWind address generator failure prediction and monitoring.
For our final model, we chose to use an Stacked ensemble mode leveraging the following tuned models. In addition, we trained these models using the undersampled dataset to maximize performance.
The final model was tested against normal test data, undersampled test data and oversampled test data. Similar to what we observed during our initial raw model, the undersampled data gave superior Recall performance.
The final model erroneously predicted generator failure 23% of the time (False Positive rate). In these cases, proactive inspections and/or repairs will be conducted, potentially delaying an actual failure and associated replacement costs.
The final model correctly predicted failures only 5% of the time. However, since we are seeking to prevent missed failure predictions, this is not an important value to maximize for this model. Whether it was predicted or not, the replacement cost still applies.
The final model correctly predicted no failures 72% of the time, resulting in no costs or resource consumption.
The final model missed predicting generator failures less than 1% of the time (.54% False Negative rate). While the cost of the failure is a very high replacement cost, few failures should remain undetected in production and these costs should be fairly predictable.
Variable Importance
While we are unable to offer specific insights due to the ciphered nature of the data, we can offer insights into the importance of each, based on our analysis.
Top 25% (Highest Quartile)
Middle 50% (Middle Quartiles)
Bottom 25% (Lowest Quartile)
Data
The business should add device identifiers to track data specifically against certain generators over their lifetime. If we find that we are correctly predicting multiple inspections and repairs for specific generators, replacement costs can more readily be planned.
If any of the top 25% of features can be monitored or forecasted, applying engineering effort toward these will yeild the highest chance of preventing generator failures.
The model developed in the notebook generalized well on our test data and should be deployed into production.
After a measurable time in production (12 months minimum), we suggest re-running this analysis to assess actual production model performance. If the undersampled model is not generalizing as well for new unseen production data, we recommend adjusting the pipeine to use the normal dataset instead.
!pip install nbconvert
Requirement already satisfied: nbconvert in /usr/local/lib/python3.10/dist-packages (6.5.4) Requirement already satisfied: lxml in /usr/local/lib/python3.10/dist-packages (from nbconvert) (4.9.4) Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (4.12.3) Requirement already satisfied: bleach in /usr/local/lib/python3.10/dist-packages (from nbconvert) (6.1.0) Requirement already satisfied: defusedxml in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.7.1) Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.4) Requirement already satisfied: jinja2>=3.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (3.1.4) Requirement already satisfied: jupyter-core>=4.7 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.7.2) Requirement already satisfied: jupyterlab-pygments in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.3.0) Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (2.1.5) Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.8.4) Requirement already satisfied: nbclient>=0.5.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (0.10.0) Requirement already satisfied: nbformat>=5.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.10.4) Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from nbconvert) (24.1) Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (1.5.1) Requirement already satisfied: pygments>=2.4.1 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (2.16.1) Requirement already satisfied: tinycss2 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (1.3.0) Requirement already satisfied: traitlets>=5.0 in /usr/local/lib/python3.10/dist-packages (from nbconvert) (5.7.1) Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.10/dist-packages (from jupyter-core>=4.7->nbconvert) (4.3.2) Requirement already satisfied: jupyter-client>=6.1.12 in /usr/local/lib/python3.10/dist-packages (from nbclient>=0.5.0->nbconvert) (6.1.12) Requirement already satisfied: fastjsonschema>=2.15 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.1->nbconvert) (2.20.0) Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.1->nbconvert) (4.23.0) Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->nbconvert) (2.6) Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.10/dist-packages (from bleach->nbconvert) (1.16.0) Requirement already satisfied: webencodings in /usr/local/lib/python3.10/dist-packages (from bleach->nbconvert) (0.5.1) Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (24.2.0) Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (2023.12.1) Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (0.35.1) Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.1->nbconvert) (0.20.0) Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (24.0.1) Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (2.8.2) Requirement already satisfied: tornado>=4.1 in /usr/local/lib/python3.10/dist-packages (from jupyter-client>=6.1.12->nbclient>=0.5.0->nbconvert) (6.3.3)
# Create an HTML version of this notebook - do this last
%%shell
jupyter nbconvert --to html '/content/drive/MyDrive/Learning/Data Coursework/PGP-DSBA/6-Model Tuning/Project 6/MT_Project_LearnerNotebook_FullCode.ipynb'
[NbConvertApp] Converting notebook /content/drive/MyDrive/Learning/Data Coursework/PGP-DSBA/6-Model Tuning/Project 6/MT_Project_LearnerNotebook_FullCode.ipynb to html [NbConvertApp] Writing 3585942 bytes to /content/drive/MyDrive/Learning/Data Coursework/PGP-DSBA/6-Model Tuning/Project 6/MT_Project_LearnerNotebook_FullCode.html